【免费】spark-operator资源-CSDN文库

共217个文件

go：135个

yaml：37个

md：14个

需积分: 0 115 浏览量 2024-06-18 19:13:39 上传评论收藏 484KB ZIP 举报

"Spark-Operator"是专为Kubernetes集群设计的一个工具，用于在Kubernetes上管理和运行Apache Spark作业。这个工具的核心目标是简化在Kubernetes环境中部署、监控和管理Spark应用的过程，使得用户无需直接与低级别的Kubernetes API交互，而是通过定义特定的CRD（Custom Resource Definition）来启动和管理Spark作业。在Kubernetes生态系统中，Operator是一种扩展API的方式，它允许开发者定义和操作复杂的应用程序和服务。Spark-Operator就是这样的一个Operator，它提供了对Apache Spark作业的声明式管理，类似于Kubernetes处理Pods和Services的方式。 1. **Spark-Operator的主要功能**： - **自动创建资源**：用户可以定义一个SparkApplication资源，Operator会根据这个定义创建相应的Pods、Services和其他必要的Kubernetes资源。 - **动态调度**：Operator可以根据资源需求和集群状态动态调度Spark作业。 - **状态跟踪和故障恢复**：它监控Spark作业的生命周期，当作业失败时，可以配置它进行重试或通知用户。 - **资源管理**：Operator可以管理Spark作业的资源请求和限制，确保作业在集群中的公平分配。 - **日志和指标**：集成Kubernetes的日志和监控系统，提供Spark作业的运行时信息。 2. **使用Spark-Operator的流程**： - **安装Operator**：需要将`spark-operator-1beta2-1.3.4-3.1.1`这样的打包文件解压，并按照官方文档在Kubernetes集群中部署Operator。 - **定义SparkApplication**：编写YAML文件，定义Spark应用的配置，如主类、JAR文件位置、主资源需求等。 - **应用定义**：使用`kubectl apply`命令将SparkApplication资源配置到集群中。 - **监控作业**：通过Kubernetes的命令行工具或者dashboard查看作业状态，包括启动、运行和完成情况。 3. **版本信息**： "spark-operator-1beta2-1.3.4-3.1.1"中的数字部分表示该Operator的版本，1beta2可能代表其处于Beta阶段的第二个版本，1.3.4是Spark版本号，3.1.1可能是Kubernetes的兼容版本。每个版本可能会修复已知问题，增加新特性，或提高与特定Spark和Kubernetes版本的兼容性。 4. **最佳实践**： - **资源规划**：合理设置Spark作业的资源请求和限制，避免资源浪费和作业竞争。 - **安全配置**：考虑使用ServiceAccount和RoleBinding进行权限控制，保护Spark作业和集群的安全。 - **监控和告警**：配置Operator的监控和告警规则，以便在作业异常时及时收到通知。 Spark-Operator是Kubernetes环境中的一个强大工具，它使得在云原生环境中运行Apache Spark变得更加便捷和可靠。通过了解并熟练掌握Spark-Operator，用户可以更好地利用Kubernetes的灵活性和弹性来管理和运行大数据处理任务。

资源推荐

资源详情

资源评论

收起资源包目录

spark-operator （217个子文件）

binary.dat 6B

Dockerfile 2KB

Dockerfile 1KB

Dockerfile 813B

.dockerignore 7B

.gitignore 149B

patch_test.go 54KB

controller_test.go 50KB

controller.go 41KB

types.go 34KB

zz_generated.deepcopy.go 28KB

types.go 25KB

patch.go 24KB

controller_test.go 23KB

zz_generated.deepcopy.go 22KB

webhook.go 20KB

submission.go 19KB

sparkui_test.go 19KB

submission_test.go 17KB

constants.go 17KB

create.go 14KB

controller.go 13KB

sparkapp_metrics.go 13KB

main.go 13KB

sparkui.go 11KB

monitoring_config_test.go 10KB

volcano_scheduler.go 10KB

scheduledsparkapplication.go 8KB

util.go 8KB

sparkapplication.go 7KB

scheduledsparkapplication.go 7KB

metrics.go 7KB

framework.go 7KB

sparkapp_util.go 7KB

spark_pod_eventhandler_test.go 7KB

sparkapplication.go 7KB

watcher.go 6KB

factory.go 6KB

fake_scheduledsparkapplication.go 6KB

webhook_test.go 6KB

fake_sparkapplication.go 6KB

fake_scheduledsparkapplication.go 6KB

defaults_test.go 6KB

monitoring_config.go 6KB

event.go 5KB

fake_sparkapplication.go 5KB

forward.go 5KB

create_test.go 5KB

handlers.go 4KB

volcano_scheduler_test.go 4KB

scheduledsparkapplication.go 4KB

enforcer.go 4KB

sparkapplication.go 4KB

clientset.go 4KB

sparkapplication.go 4KB

helpers.go 4KB

basic_test.go 4KB

secret_test.go 4KB

log.go 4KB

clientset_generated.go 3KB

service.go 3KB

lifecycle_test.go 3KB

spark_pod_eventhandler.go 3KB

sparkapp_metrics_test.go 3KB

sparkoperator.k8s.io_client.go 3KB

generic.go 3KB

volume_mount_test.go 3KB

status.go 3KB

cluster_role_binding.go 3KB

main_test.go 3KB

role_binding.go 3KB

secret.go 3KB

sparkapplication.go 3KB

job.go 3KB

certs.go 3KB

cluster_role.go 2KB

deployment.go 2KB

defaults.go 2KB

service_account.go 2KB

defaults.go 2KB

role.go 2KB

config.go 2KB

scheduler_manager.go 2KB

client.go 2KB

interface.go 2KB

config_test.go 2KB

capabilities.go 2KB

gcs.go 2KB

doc.go 2KB

util.go 2KB

共 217 条

# sparkctl `sparkctl` is a command-line tool of the Spark Operator for creating, listing, checking status of, getting logs of, and deleting `SparkApplication`s. It can also do port forwarding from a local port to the Spark web UI port for accessing the Spark web UI on the driver. Each function is implemented as a sub-command of `sparkctl`. To build `sparkctl`, make sure you followed build steps [here](https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/developer-guide.md#build-the-operator) and have all the dependencies, then run the following command from within `sparkctl/`: ```bash $ go build -o sparkctl ``` ## Flags The following global flags are available for all the sub commands: * `--namespace`: the Kubernetes namespace of the `SparkApplication`(s). Defaults to `default`. * `--kubeconfig`: the path to the file storing configuration for accessing the Kubernetes API server. Defaults to `$HOME/.kube/config` ## Available Commands ### Create `create` is a sub command of `sparkctl` for creating a `SparkApplication` object. There are two ways to create a `SparkApplication` object. One is parsing and creating a `SparkApplication` object in namespace specified by `--namespace` the from a given YAML file. In this way, `create` parses the YAML file, and sends the parsed `SparkApplication` object parsed to the Kubernetes API server. Usage of this way looks like the following: Usage: ```bash $ sparkctl create <path to YAML file> ``` The other way is creating a `SparkApplication` object from a named `ScheduledSparkApplication` to manually force a run of the `ScheduledSparkApplication`. Usage of this way looks like the following: Usage: ```bash $ sparkctl create <name of the SparkApplication> --from <name of the ScheduledSparkApplication> ``` The `create` command also supports shipping local Hadoop configuration files into the driver and executor pods. Specifically, it detects local Hadoop configuration files located at the path specified by the environment variable `HADOOP_CONF_DIR`, create a Kubernetes `ConfigMap` from the files, and adds the `ConfigMap` to the `SparkApplication` object so it gets mounted into the driver and executor pods by the operator. The environment variable `HADOOP_CONF_DIR` is also set in the driver and executor containers. #### Staging local dependencies The `create` command also supports staging local application dependencies, though currently only uploading to a Google Cloud Storage (GCS) bucket is supported. The way it works is as follows. It checks if there is any local dependencies in `spec.mainApplicationFile`, `spec.deps.jars`, `spec.deps.files`, etc. in the parsed `SparkApplication` object. If so, it tries to upload the local dependencies to the remote location specified by `--upload-to`. The command fails if local dependencies are used but `--upload-to` is not specified. By default, a local file that already exists remotely, i.e., there exists a file with the same name and upload path remotely, will be ignored. If the remote file should be overridden instead, the `--override` flag should be specified. ##### Uploading to GCS For uploading to GCS, the value should be in the form of `gs://<bucket>`. The bucket must exist and uploading fails if otherwise. The local dependencies will be uploaded to the path `spark-app-dependencies/<SparkApplication namespace>/<SparkApplication name>` in the given bucket. It replaces the file path of each local dependency with the URI of the remote copy in the parsed `SparkApplication` object if uploading is successful. Note that uploading to GCS requires a GCP service account with the necessary IAM permission to use the GCP project specified by service account JSON key file (`serviceusage.services.use`) and the permission to create GCS objects (`storage.object.create`). The service account JSON key file must be locally available and be pointed to by the environment variable `GOOGLE_APPLICATION_CREDENTIALS`. For more information on IAM authentication, please check [Getting Started with Authentication](https://cloud.google.com/docs/authentication/getting-started). Usage: ```bash $ export GOOGLE_APPLICATION_CREDENTIALS="[PATH]/[FILE_NAME].json" $ sparkctl create <path to YAML file> --upload-to gs://<bucket> ``` By default, the uploaded dependencies are not made publicly accessible and are referenced using URIs in the form of `gs://bucket/path/to/file`. Such dependencies are referenced through URIs of the form `gs://bucket/path/to/file`. To download the dependencies from GCS, a custom-built Spark init-container with the [GCS connector](https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage) installed and necessary Hadoop configuration properties specified is needed. An example Docker file of such an init-container can be found [here](https://gist.github.com/liyinan926/f9e81f7b54d94c05171a663345eb58bf). If you want to make uploaded dependencies publicly available so they can be downloaded by the built-in init-container, simply add `--public` to the `create` command, as the following example shows: ```bash $ sparkctl create <path to YAML file> --upload-to gs://<bucket> --public ``` Publicly available files are referenced through URIs of the form `https://storage.googleapis.com/bucket/path/to/file`. ##### Uploading to S3 For uploading to S3, the value should be in the form of `s3://<bucket>`. The bucket must exist and uploading fails if otherwise. The local dependencies will be uploaded to the path `spark-app-dependencies/<SparkApplication namespace>/<SparkApplication name>` in the given bucket. It replaces the file path of each local dependency with the URI of the remote copy in the parsed `SparkApplication` object if uploading is successful. Note that uploading to S3 with [AWS SDK](https://docs.aws.amazon.com/sdk-for-go/v1/developer-guide/configuring-sdk.html) requires credentials to be specified. For GCP, the S3 Interoperability credentials can be retrieved as described [here](https://cloud.google.com/storage/docs/migrating#keys). SDK uses the default credential provider chain to find AWS credentials. The SDK uses the first provider in the chain that returns credentials without an error. The default provider chain looks for credentials in the following order: - Environment variables ``` AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY ``` - Shared credentials file (.aws/credentials) For more information about AWS SDK authentication, please check [Specifying Credentials](https://docs.aws.amazon.com/sdk-for-go/v1/developer-guide/configuring-sdk.html#specifying-credentials). Usage: ```bash $ export AWS_ACCESS_KEY_ID=[KEY] $ export AWS_SECRET_ACCESS_KEY=[SECRET] $ sparkctl create <path to YAML file> --upload-to s3://<bucket> ``` By default, the uploaded dependencies are not made publicly accessible and are referenced using URIs in the form of `s3a://bucket/path/to/file`. To download the dependencies from S3, a custom-built Spark Docker image with the required jars for `S3A Connector` (`hadoop-aws-2.7.6.jar`, `aws-java-sdk-1.7.6.jar` for Spark build with Hadoop2.7 profile, or `hadoop-aws-3.1.0.jar`, `aws-java-sdk-bundle-1.11.271.jar` for Hadoop3.1) need to be available in the classpath, and `spark-default.conf` with the AWS keys and the S3A FileSystemClass needs to be set (you can also use `spec.hadoopConf` in the SparkApplication YAML): ``` spark.hadoop.fs.s3a.endpoint https://storage.googleapis.com spark.hadoop.fs.s3a.access.key [KEY] spark.hadoop.fs.s3a.secret.key [SECRET] spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem ``` NOTE: In Spark 2.3 init-containers are used for downloading remote application dependencies. In future versions, init-containers are removed. It is recommended to use Apache Spark 2.4 for staging local dependencies with `s3`, which currently requires building a custom Docker image from the Spark master branch. Additionally, since Spark 2.4.0 there are two available bui

评论收藏

内容反馈