大型数据集内的趋势_bp4d数据集资源-CSDN文库

需积分: 9 198 浏览量 2019-01-19 20:33:14 上传评论收藏 2.05MB PDF 举报

### 大型数据集中趋势分析：利用R语言识别新型关联在大数据时代，处理和分析海量数据成为企业和研究机构面临的重大挑战之一。如何有效地从这些数据中发现有价值的信息和趋势，成为了数据分析领域的核心议题。本文将围绕“大型数据集内的趋势”这一主题，基于R语言的实践案例，探讨如何处理大型数据以及如何从中识别出新型关联。 #### 1. R语言简介及其在数据分析中的应用 R语言是一种用于统计计算和图形展示的开源编程语言，因其强大的数据处理能力和丰富的统计包而广泛应用于数据科学领域。对于处理大型数据集来说，R语言提供了多种工具和技术来提高效率和性能。 #### 2. 处理大型数据集的方法在处理大型数据集时，面临的主要挑战包括内存限制、计算效率等。为了解决这些问题，可以采取以下几种策略： - **分块处理**：通过将数据分成较小的块进行逐个处理，避免一次性加载整个数据集到内存中。 - **并行计算**：利用多核处理器或分布式计算环境（如R中的`parallel`包或Spark）来加速计算过程。 - **优化数据结构**：使用更高效的数据结构（如`data.table`包）来存储和操作数据。 #### 3. 识别大型数据集中的新型关联针对大型数据集中变量之间的关系，研究者们提出了多种方法来识别潜在的关联。其中一项值得关注的技术是最大信息系数(Maximal Information Coefficient, MIC)。MIC是一种能够捕捉广泛类型关联的度量指标，适用于函数性和非函数性关系。 - **MIC的原理**：MIC旨在衡量两个变量间关联强度的最大值。它通过比较所有可能的函数模型来找出最佳拟合模型，并计算其信息系数。该方法能够检测到线性、非线性甚至是复杂的关联模式。 - **MIC的应用**：MIC被广泛应用于生物学、医学等领域的大规模数据分析中。例如，在遗传学研究中，可以通过MIC来探索基因表达水平与特定疾病之间的关联。 - **实现MIC**：在R语言中，可以通过安装`minerva`或`mic`等包来实现MIC算法。这些包提供了计算MIC所需的函数和工具。 #### 4. 实践案例：使用R语言检测大型数据集中的关联假设我们有一份大型的医疗记录数据集，目标是识别哪些因素与心脏病发病率之间存在显著关联。为了实现这一目标，可以按照以下步骤进行： 1. **数据预处理**：清洗数据，去除缺失值和异常值，确保数据质量。 2. **特征选择**：根据领域知识和初步分析结果筛选出可能与心脏病相关的变量。 3. **应用MIC**：使用R中的`mic`包对筛选出的变量进行MIC计算，以识别最强的关联。 4. **结果解读**：分析MIC结果，找出那些MIC值较高的变量组合，进一步探究这些变量与心脏病发病率之间的关系。 #### 5. 结论通过对R语言的实践案例介绍，我们可以看到，即使是面对非常庞大的数据集，也能够通过有效的数据处理技术和高级统计方法（如MIC）来发现有价值的信息和趋势。未来，随着技术的发展，R语言在大型数据集分析方面的应用将会更加广泛。利用R语言处理大型数据集并在其中寻找有意义的趋势和关联是一项重要的技能。通过掌握正确的技术和方法，我们可以更好地应对大数据时代的挑战。

资源推荐

资源详情

资源评论

DOI: 10.1126/science.1205438

, 1518 (2011);334 Science

, et al.David N. Reshef

Detecting Novel Associations in Large Data Sets

This copy is for your personal, non-commercial use only.

clicking here.colleagues, clients, or customers by

, you can order high-quality copies for yourIf you wish to distribute this article to others

here.following the guidelines

can be obtained byPermission to republish or repurpose articles or portions of articles

): December 15, 2011 www.sciencemag.org (this infomation is current as of

The following resources related to this article are available online at

http://www.sciencemag.org/content/334/6062/1518.full.html

version of this article at:

including high-resolution figures, can be found in the onlineUpdated information and services,

http://www.sciencemag.org/content/suppl/2011/12/15/334.6062.1518.DC2.html

http://www.sciencemag.org/content/suppl/2011/12/14/334.6062.1518.DC1.html

can be found at: Supporting Online Material

http://www.sciencemag.org/content/334/6062/1518.full.html#ref-list-1

, 6 of which can be accessed free:cites 35 articlesThis article

http://www.sciencemag.org/content/334/6062/1518.full.html#related-urls

1 articles hosted by HighWire Press; see:cited by This article has been

registered trademark of AAAS.

CopyrightAmerican Association for the Advancement of Science, 1200 New York Avenue NW, Washington, DC 20005.

(print ISSN 0036-8075; online ISSN 1095-9203) is published weekly, except the last week in December, by theScience

on December 15, 2011www.sciencemag.orgDownloaded from

Detecting Novel Associations

in Large Data Sets

David N. Reshef,

1,2,3

† Yakir A. Reshef,

2,4

† Hilary K. Finucane,

Sharon R. Grossman,

2,6

Gilean McVean,

3,7

Peter J. Turnbaugh,

Eric S. Lander,

2,8,9

Michael Mitzenmacher,

‡ Pardis C. Sabeti

2,6

‡

Identifying interesting relationships between pairs of variables in large data sets is increasingly

important. Here, we present a measure of dependence for two-variable relationships: the maximal

information coefficient (MIC). MIC captures a wide range of associations both functional and

not, and for functional relationships provides a score that roughly equals the coefficient of

determination (R

) of the data relative to the regression function. MIC belongs to a larger

class of maximal information-based nonparametric exploration (MINE) statistics for identifying

and classifying relationships. We apply MIC and MINE to data sets in global health, gene

expression, major-league baseball, and the human gut microbiota and identify known and

novel relationships.

magine a data set with hundreds of variables,

which may contain important, undiscovered

relationships. There are tens of thousands of

variable pairs—far too many to examine manu-

ally . If you do not already know what kinds of

relationships to search for, how do you efficiently

identify the important ones? Data sets of this size

are increasingly common in fields as varied as

genomics, physics, political science, and econom-

ics, making this question an important and grow-

ing challenge (1, 2).

One way to begin exploring a large data set

is to search for pairs of variables that are closely

associated. T o do this, we could calculate some

measure of dependence for each pair, rank the

pairs by their scores, and examine the top-scorin g

pa ir s . Fo r this strategy to work, the statistic we

use to measure dependence should have two heu-

ristic properties: generality and equitability.

By generality, we mean that with sufficient

sample size the statistic should capture a wide

range of interesting associations, not limited to

specific function types (such as linear, exponential,

or periodic), or even to all functional relation-

ships (3). The latter condition is desirable because

not only do relationships take many functional

forms, but many important relationships—for ex-

ample, a superposition of functions—are not well

modeled by a function (4–7).

By equitability, w e mean that the statistic

should give similar scores to equally noisy rela-

tionships of different types. For example, we do

not want noisy linear relationships to drive strong

sinusoidal relationships from the top of the list.

Equitability is difficult to formalize for associa-

tions in general but has a clear interpretation in

the basic case of functional relationships: An equi-

table statistic should give similar scores to func-

tional relationships with similar R

values (given

sufficient sample size).

Here, we describe an exploratory data anal-

ysis tool, the maximal information coefficient

(MIC), that satisfies these two heuristic proper-

ties. W e establish MIC’s generality through proofs,

show its equitability on functional relationships

through simulations, and observe that this trans-

lates into intuitively equitable behavior on more

general associations. Furthermore, we illustrate

that MIC gives rise to a larger family of sta-

tistics, which we refer to as MINE, or maximal

information-based nonparametric exploration.

MINE statistics can be used not only to identify

interesting associations, but also to characterize

them according to properties such as nonline-

arity and monotonicity. We demonstrate the

application of MIC and MINE to data sets in

health, baseball, genomics, and the human

microbiota.

The maxi mal information coeffic ient. Intu-

itively, MIC is based on the idea that if a re-

lationship exists between two variables, then a

grid can be drawn on the scatterplot of the two

variables that partitions the data to encapsulate

that relationship. Thus, to calculate the MIC of a

set of two-variable data, we explore all grids up

to a maximal grid resolution, dependent on the

sample size (Fig. 1A), computing for every pair

of integers (x,y) the largest possible mutual in-

formation achievable by any x-by-y grid applied

to the data. We then normalize these mutual

information values to ensure a fair comparison

be twe en grids of different dimensions and to ob-

tain modified values between 0 and 1. We de-

fine the characteristic matrix M=(m

x,y

), where

x,y

is the highest normalized mutual infor-

mation achieved by any x-by-y grid, and the

statistic MIC to be the maximum value in M

(Fig. 1, B and C).

More formally, for a grid G, let I

denote

the mutual information of the probability dis-

Department of Computer Science, Massachusetts Institute of

Technology (MIT), Cambridge, MA 02139, USA.

Broad Institute

of MIT and Harvard, Cambridge, MA 02142, USA.

Department

of Statistics, University of Oxford, Oxford OX1 3TG, UK.

De-

partment of Mathematics, Harvard College, Cambridge, MA

02138, USA.

Department of Computer Science and Applied

Mathematics, Weizmann Institute of Science, Rehovot, Israel.

Center for Systems Biology, Department of Organismic and

Evolutionary Biology, Harvard University, Cambridge, MA 02138,

USA.

Wellcome Trust Centre for Human Genetics, University of

Oxford, Oxford OX3 7BN, UK.

Department of Biology, MIT,

Cambridge, MA 02139, USA.

Department of Systems Biology,

Harvard Medical School, Boston, MA 02115, USA.

School of

Engineering and Applied Sciences, Harvard University, Cam-

bridge, MA 02138, USA.

*These authors contributed equally to this work.

†To whom correspondence should be addressed. E-mail:

dnreshef@mit.edu (D.N.R.); yreshef@post.harvard.edu (Y.A.R.)

‡These authors contributed equally to this work.

Columns

Rows

32...

Normalized Score

Vertical Axis Bins

ori

tal Ax

0.5

0.0

1.0

23...

2 x 2 2 x 3 x x y

Fig. 1. Computing MIC (A)Foreachpair(x,y), the

MIC algorithm finds the x-by-y grid with the highest

induced mutual information. (B)Thealgorithm

normalizes the mutual information scores and

compiles a matrix that stores, for each resolution,

thebestgridatthatresolutionanditsnormalized

score. (C) The normalized scores form the char-

acteristic matrix, which can be visualized as a sur-

face; MIC corresponds to the highest point on this

surface. In this example, there are many grids that

achieve the highest score. The star in (B) marks a

sample grid achieving this score, and the star in (C)

marks that grid’s corresponding location on the

surface.

RESEARCH ARTICLES

16 DECEMBER 2011 VOL 334 SCIENCE www.sciencemag.org1518

on December 15, 2011www.sciencemag.orgDownloaded from

剩余7页未读，继续阅读

评论收藏

内容反馈

ljh945945

粉丝: 0
资源: 1

大型数据集内的趋势

Facial-Action-Unit-Detection

最新FG-NET人脸数据集

bp分类 包含数据集，训练数据测试数据matlab

bp测试数据集

Tor网络流量数据集

杂货数据集.rar

matlab做趋势的代码-Subduction:MATLAB代码和数据集

神经网络数据集

DISFA情绪识别数据集

英文情绪分类数据集文件

大数据时代下的数据安全.doc

大型狩猎：探索不列颠哥伦比亚省的历史和地理趋势

数据中心建设方案

trend:用于沿 3D 数据中的任何维度映射趋势的快速功能。-matlab开发

数据分析-大数据

人工智能通用大模型（ChatGPT）的进展、风险与应对.pdf

数据结构最终项目：各种链表，二进制搜索树和哈希表的实现； 比较效率

社交媒体和股市预测：大数据方法-研究论文

NAMCShiny:基于 2003-2010 年全国门诊医疗调查数据探索就诊原因的交互式网络应用程序

EXCEL 2010 VBA 中文帮助文档.chm

Scalable-Data-Analysis-using-Pandas:项目

网络调试助手NetAssist5.0.3.zip

考虑价格型需求响应；负荷需求响应；综合能源系统；微电网；优化调度；Logistic函数；MATLAB参考文献：计及分时电价的

最新资源

bp分类包含数据集，训练数据测试数据matlab

数据结构最终项目：各种链表，二进制搜索树和哈希表的实现；比较效率