# Apache Spark
Spark is a unified analytics engine for large-scale data processing. It provides
high-level APIs in Scala, Java, Python, and R, and an optimized engine that
supports general computation graphs for data analysis. It also supports a
rich set of higher-level tools including Spark SQL for SQL and DataFrames,
MLlib for machine learning, GraphX for graph processing,
and Structured Streaming for stream processing.
<https://spark.apache.org/>
[![Jenkins Build](https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7-hive-2.3/badge/icon)](https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7-hive-2.3)
[![AppVeyor Build](https://img.shields.io/appveyor/ci/ApacheSoftwareFoundation/spark/master.svg?style=plastic&logo=appveyor)](https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark)
[![PySpark Coverage](https://img.shields.io/badge/dynamic/xml.svg?label=pyspark%20coverage&url=https%3A%2F%2Fspark-test.github.io%2Fpyspark-coverage-site&query=%2Fhtml%2Fbody%2Fdiv%5B1%5D%2Fdiv%2Fh1%2Fspan&colorB=brightgreen&style=plastic)](https://spark-test.github.io/pyspark-coverage-site)
## Online Documentation
You can find the latest Spark documentation, including a programming
guide, on the [project web page](https://spark.apache.org/documentation.html).
This README file only contains basic setup instructions.
## Building Spark
Spark is built using [Apache Maven](https://maven.apache.org/).
To build Spark and its example programs, run:
./build/mvn -DskipTests clean package
(You do not need to do this if you downloaded a pre-built package.)
More detailed documentation is available from the project site, at
["Building Spark"](https://spark.apache.org/docs/latest/building-spark.html).
For general development tips, including info on developing Spark using an IDE, see ["Useful Developer Tools"](https://spark.apache.org/developer-tools.html).
## Interactive Scala Shell
The easiest way to start using Spark is through the Scala shell:
./bin/spark-shell
Try the following command, which should return 1,000,000,000:
scala> spark.range(1000 * 1000 * 1000).count()
## Interactive Python Shell
Alternatively, if you prefer Python, you can use the Python shell:
./bin/pyspark
And run the following command, which should also return 1,000,000,000:
>>> spark.range(1000 * 1000 * 1000).count()
## Example Programs
Spark also comes with several sample programs in the `examples` directory.
To run one of them, use `./bin/run-example <class> [params]`. For example:
./bin/run-example SparkPi
will run the Pi example locally.
You can set the MASTER environment variable when running examples to submit
examples to a cluster. This can be a mesos:// or spark:// URL,
"yarn" to run on YARN, and "local" to run
locally with one thread, or "local[N]" to run locally with N threads. You
can also use an abbreviated class name if the class is in the `examples`
package. For instance:
MASTER=spark://host:7077 ./bin/run-example SparkPi
Many of the example programs print usage help if no params are given.
## Running Tests
Testing first requires [building Spark](#building-spark). Once Spark is built, tests
can be run using:
./dev/run-tests
Please see the guidance on how to
[run tests for a module, or individual tests](https://spark.apache.org/developer-tools.html#individual-tests).
There is also a Kubernetes integration test, see resource-managers/kubernetes/integration-tests/README.md
## A Note About Hadoop Versions
Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported
storage systems. Because the protocols have changed in different versions of
Hadoop, you must build Spark against the same version that your cluster runs.
Please refer to the build documentation at
["Specifying the Hadoop Version and Enabling YARN"](https://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version-and-enabling-yarn)
for detailed guidance on building for a particular distribution of Hadoop, including
building for particular Hive and Hive Thriftserver distributions.
## Configuration
Please refer to the [Configuration Guide](https://spark.apache.org/docs/latest/configuration.html)
in the online documentation for an overview on how to configure Spark.
## Contributing
Please review the [Contribution to Spark guide](https://spark.apache.org/contributing.html)
for information on how to get started contributing to the project.
spark-3.0.0-bin-hadoop3.2
需积分: 0 85 浏览量
更新于2023-02-17
1
收藏 215.18MB ZIP 举报
Spark是Apache软件基金会的一个开源大数据处理框架,以其高效、易用和可扩展性著称。在本场景中,我们讨论的是Spark的3.0.0版本,与Hadoop3.2相结合的二进制发行版——"spark-3.0.0-bin-hadoop3.2"。这个压缩包是为了在Windows操作系统下运行Spark而设计的,因此标签明确指出它是适用于Windows平台的包。
Spark 3.0.0是Spark发展中的一个重要里程碑,它引入了许多新特性和性能优化。以下是一些关键知识点:
1. **Databricks Runtime (DBR) 合并**:Spark 3.0.0与Databricks Runtime进行了部分融合,引入了更多针对大规模数据处理和机器学习的优化。
2. **SQL增强**:Spark SQL得到了重大改进,包括对Hive Metastore的更好支持,新的DataFrame API,以及对标准SQL语法的更全面支持,使得数据分析更加便捷。
3. **性能提升**:Spark 3.0.0对Shuffle过程进行了优化,减少了数据传输和磁盘I/O,从而提高了整体性能。此外,还引入了Tungsten和Codegen技术,进一步加速了执行速度。
4. **PySpark改进**:Python API(PySpark)在新版本中得到了增强,支持更多的Python数据类型,提升了Python用户的工作效率。
5. **内存管理**:引入了统一内存管理模型,旨在更有效地利用内存资源,减少数据序列化和反序列化的开销。
6. **Kubernetes原生支持**:Spark 3.0.0增强了对Kubernetes的原生支持,使用户能够更方便地在Kubernetes集群上部署和管理Spark作业。
7. **安全特性**:提供了更强大的安全特性,如加密通信、身份验证和授权,确保了数据在处理过程中的安全性。
8. **Hadoop 3.2兼容性**:此版本的Spark与Hadoop 3.2兼容,意味着可以充分利用Hadoop的新功能,如YARN的资源调度优化和HDFS的增强。
9. **机器学习库MLlib**:MLlib在3.0.0版本中也有所更新,支持更多的算法,同时提供了更好的模型解释性和可重复性。
10. **图形处理库GraphX**:对于图计算,GraphX提供了一组API来处理和分析图数据,3.0.0版本可能包含了新的优化和增强。
在解压"spark-3.0.0-bin-hadoop3.2"后,你将找到包含Spark运行所需的所有组件,如bin目录下的可执行脚本,lib目录下的库文件,以及conf目录下的配置文件。在Windows环境下,你可以通过修改配置文件,设置环境变量,并使用提供的启动脚本来运行Spark Shell、Spark Submit等工具,开始你的大数据处理之旅。
为了充分利用Spark的功能,你需要了解如何配置Spark的运行环境,如设置Master和Worker节点,配置内存和CPU资源,以及理解和编写Spark程序。同时,理解Hadoop生态系统的其他组件,如HDFS和YARN,将有助于更好地集成和管理Spark作业。
Spark 3.0.0-bin-hadoop3.2是一个强大且灵活的大数据处理工具,适用于Windows平台,为开发者提供了高效的数据处理和分析能力。通过深入学习和实践,你可以掌握这一工具,解决各种大数据问题,实现复杂的分析任务。