大规模数据的网格.zip资源-CSDN文库

共1个文件

pdf：1个

版权申诉

135 浏览量 2023-01-30 20:21:46 上传评论收藏 135KB ZIP 举报

标题中的“大规模数据的网格”很可能是指在处理大量数据时采用的一种分布式计算模型——网格计算。网格计算是一种利用网络连接的多台计算机协同工作来处理大规模数据的技术，它能够有效地解决单机系统难以应对的数据处理问题。在IT行业中，面对PB级别的数据，网格计算是不可或缺的工具。描述中的“.zip”表明这是一个压缩文件，通常用于存储多个相关文件或目录，以减少存储空间并方便传输。在C#编程环境中，可以使用System.IO.Compression命名空间下的ZipFile类来处理ZIP压缩文件，例如进行解压、压缩等操作。标签“C#”意味着我们将关注的是使用C#语言处理这些数据和网格计算的相关技术。C#是微软开发的一种面向对象的编程语言，广泛应用于Windows平台的软件开发，包括桌面应用、Web应用以及服务器端应用。在处理大数据时，C#可以通过.NET框架和相关库（如Apache Spark的.NET版本，称为.NET for Apache Spark）与分布式计算框架结合，实现大规模数据处理。在压缩包子文件的文件名称列表中，我们看到两个文件： 1. MAFIA.pdf：这可能是一份关于“MAFIA”（可能是某种技术、算法或框架的缩写）的文档。由于没有具体信息，我们可以假设这是有关网格计算中的一种方法或技术，可能详细介绍了如何在大规模数据处理中应用。 2. G2：这个名字可能是某个项目、库或工具的名称，可能是与数据可视化相关的，因为G2在图形领域经常被用作图形语法的简称。在大规模数据处理中，数据可视化是至关重要的，因为它能帮助用户理解和洞察复杂数据的模式、趋势和异常。在C#中处理大规模数据，我们可以使用以下技术： - **多线程和异步编程**：C#提供了Task Parallel Library (TPL) 和 async/await 关键字，使得开发者可以轻松地在多核处理器上并行处理任务，提高数据处理速度。 - **大数据处理框架**：如前面提到的.NET for Apache Spark，允许C#开发者使用Spark的分布式计算能力处理大数据。 - **内存数据库**：如Redis或Memcached，可提供高速的数据缓存，加速数据读取。 - **流处理框架**：如Apache Kafka或NServiceBus，用于实时处理和分析持续流入的大规模数据。 - **NoSQL数据库**：如MongoDB或Cassandra，适合处理非结构化和半结构化的大数据。 - **数据仓库和OLAP**：如Microsoft SQL Server Analysis Services，用于数据分析和商业智能。在实际应用中，开发者可能会结合上述技术和工具，构建出高效的大规模数据处理系统。通过C#的强类型和面向对象特性，以及.NET生态系统中的丰富库，可以实现灵活且高性能的数据网格解决方案。同时，配合有效的数据分片、分区策略，以及负载均衡，可以进一步提升系统的扩展性和容错性。对于数据可视化，C#开发者可以使用G2库或者其他的库如OxyPlot和LiveCharts，创建交互式的图表，直观地展示大数据的结果。

资源推荐

资源详情

资源评论

收起资源包目录

大规模数据集的自适应网格.zip （1个子文件）

MAFIA.pdf 159KB

Adaptive Grids for

Clustering Massive Data

Sets

∗

Harsha Nagesh

†

, Sanjay Goil

‡

, and Alok Choud-

hary

Clustering is a key data mining problem. Density and grid based technique is

a popular way to mine clusters in a large multi-dimensional space wherein clusters

are regarded as dense regions than their surroundings. The attribute values and

ranges of these attributes characterize the clusters. Fine grid sizes lead to a huge

amount of computation while coarse grid sizes result in loss in quality of clusters

found. Also, varied grid sizes result in discovering clusters with diﬀerent cluster

descriptions. The technique of Adaptive grids enables to use grids based on the data

distribution and does not require the user to specify any parameters like the grid

size or the density thresholds. Further, clusters could be embedded in a subspace

of a high dimensional space. We propose a modiﬁed bottom-up subspace clustering

algorithm to discover clusters in all possible subspaces. Our method scales linearly

with the data dimensionality and the size of the data set. Experimental results

on a wide variety of synthetic and real data sets demonstrate the eﬀectiveness of

Adaptive grids and the eﬀect of the modiﬁed subspace clustering algorithm. Our

algorithm explores at-least an order of magnitude more number of subspaces than

the original algorithm and the use of adaptive grids yields on an average of two

orders of magnitude speedup as compared to the method with user speciﬁed grid

size and threshold.

∗

This work was supported in part by NSF Young Investigator Award CCR-9357840, NSF

CCR-9509143 and Department of Energy ASCI/ASAP program

†

Mr.Harsha Nagesh, Bell-labs Research, Murray Hill, NJ, harsha@research.bell-labs.com

‡

Dr. Sanjay Goil, Sun Microsystems Inc., sanjay.goil@eng.sun.com

Prof. Alok Choudhary, Department of ECE, Northwestern University, choudhar@ece.nwu.edu

1 Introduction

The ability to identify interesting patterns in large scale consumer data empowers

business establishments to leverage on the information obtained. Clustering process

is a data mining technique which ﬁnds such patterns, previously unknown in large

scale data, embedded in a large multi-dimensional space. Clustering techniques

ﬁnd application in several ﬁelds. Clustering web documents based on web logs has

been studied in [1], customer segmentation based on similarity of buying interests

is explored as collaborative ﬁltering in [2], [3] explores the detection of clusters

in geographic information systems. Clustering algorithms need to address several

issues. Scalability of these algorithms with the data base size is as important as

their scalability with the dimensionality of the data sets. Eﬀective representation

of the detected clusters is as important as cluster detection. Simplicity in reporting

clusters among those found improves the usability of the algorithm. Further, most

clustering algorithms require key input parameters from the user which are hard to

determine and also greatly control the process of cluster detection. Clusters could

also be embedded in a subspace of the total data space. Detection of clusters in all

possible subspaces results in an exponential algorithm as the number of subspaces is

exponential in the data dimension. Most of the earlier works in statistics and data

mining [4, 5] operate and ﬁnd clusters in the whole data space. Many clustering

algorithms [3, 6, 5] require user input of several parameters like the number of

clusters, average dimensionality of the cluster, etc. which are not only diﬃcult to

determine but are also not practical for real-world data sets.

We use a grid and density based approach for cluster detection in subspaces.

Density based approaches regard clusters as high density regions than their sur-

roundings. In a grid and density approach a multi-dimensional space is divided into

a large number of hyper-rectangular regions and regions which have more points

than a speciﬁed threshold are identiﬁed as dense. Finally dense hyper-rectangular

regions that are adjacent to each other are merged to ﬁnd the embedded clusters.

The quality of results and the computation requirements heavily depend on the

number of bins in each dimension. Hence, determination of bin sizes automati-

cally based on the data distribution greatly helps in ﬁnding correct clusters of high

quality and reduces the computation substantially.

1.1 Contributions

In this paper we present MAFIA

, a scalable subspace clustering algorithm using

adaptive computation of the ﬁnite intervals (bins) in each dimension, which are

merged to explore clusters in higher dimensions. Adaptive grid sizes improves the

clustering quality by concentrating on the portions of the data space which have

more points and thus are more likely to be part of a cluster region enabling minimal

length DNF expressions, important for interpreting results by the end-user. MAFIA

does not require any key user inputs and takes in only the strength of the clusters

that needs to be discovered in the given data set. Further, we present a modi-

ﬁed bottom-up algorithm for cluster detection in all possible subspaces previously

Merging of Adaptive Finite Intervals

unexplored by other subspace clustering algorithms. We describe recent work on

clustering techniques in databases in Section 2. Density and grid based clustering

is presented in Section 3, where we describe subspace clustering as introduced by

[7] and also describe our approach of using adaptive grids. Section 4 presents the

modiﬁed subspace clustering algorithm with a theoretical analysis. Finally, Section

5 presents the performance evaluation on a wide variety of synthetic and real data

sets with large number of dimensions, highlighting both scalability of the algorithms

and the quality of the clustering. Section 6 concludes the paper.

2 Related Work

Clustering algorithms have long been studied in statistics [8], machine learning

[9], pattern recognition, image processing [10] and databases. k-means, k-mediods,

CLARANS [11], BIRCH [5], CURE [12] are some of the earlier works. However,

none of these algorithms detect clusters in subspaces. PROCLUS [6], a subspace

clustering algorithm ﬁnds representative cluster centers in an appropriate set of

cluster dimensions. It needs the number of clusters, k, and the average cluster

dimensionality, l, as input parameters, both of which are not possible to be known

a-priori for real data sets. Density and grid based approaches regard clusters as

regions of data space in which objects are dense and are separated by regions of low

object density [13]. The grid size determines the computations and the quality of

the clustering. CLIQUE, a density and grid based approach for high dimensional

data sets [7], detects clusters in the highest dimensional subspaces taking the size

of the grid and a global density threshold for clusters as inputs. The computation

complexity and the quality of clustering is heavily dependent on these parameters.

ENCLUS [14], an entropy based subspace clustering algorithm requires a prohibitive

amount of time to just discover interesting subspaces in which clusters are embedded

and requires entropy thresholds as inputs, which is not intuitive for the user.

3 Density and Grid based Clustering

Density based approaches regard clusters as high density regions than their sur-

roundings. A common way of ﬁnding high-density regions in the data space is

based on the grid cell densities [13]. A histogram is constructed by partitioning the

data space into a number of non-overlapping regions and then mapping the data

points to each cell in the grid. Equal length intervals are used in [7] to partition

each dimension, which results in uniform volume cells. The number of points inside

the cell with respect to the volume of the cell can be used to determine the density

of the cell. Clusters are unions of connected high density cells. Two k-dimensional

cells are connected if they have a common face in the k-dimensional space or if

they are connected by a common cell. Creating a histogram that counts the points

contained in each unit is infeasible in high dimensional data as the number of hyper-

rectangles is exponential in the dimensionality of the data set. Subspace clustering

further complicates the problem as it results in an explosion of such units. One

needs to create histograms in hyper-rectangles formed in all possible subspaces. A

bottom-up approach of ﬁnding dense units and merging them to ﬁnd dense clusters

in higher dimensional subspaces has been proposed in CLIQUE [7]. Each dimension

is divided into a user speciﬁed number of intervals, . The algorithm starts by deter-

mining 1-dimensional dense units by making a pass over the data. In [7] candidate

dense cells in any k dimensions are obtained by merging the dense cells in (k − 1)

dimensions which share the ﬁrst (k − 2) dimensions. A pass over data is made to

ﬁnd which of the candidate dense cells are actually dense. The algorithm terminates

when no more candidate dense cells are generated. In [7] candidate dense units are

pruned based on a minimum description length technique to ﬁnd the dense units

only in interesting subspaces. However, as noted in [7] this could result in missing

some dense units in the pruned subspaces. In order to maintain the high quality of

clustering we do not use this pruning technique.

3.1 Adaptive Grids

We propose an adaptive interval size in which bins are determined based on the data

distribution in a particular dimension. The size of the bin and hence number of

bins in each dimension in turn determine the computation and quality of clustering.

Finer grids leads to an explosion in the number of candidate dense units (CDUs),

while coarser grids leads to fewer bins, and regions with noise data might also get

propagated as dense cells. Also, a user deﬁned uniform grid size may fail to detect

many clusters or may yield very poor quality results. A single pass over the data is

done in order to construct a histogram in every dimension. Algorithm 3.1 describes

the steps of the adaptive grid technique. The domain of each dimension is divided

into ﬁne intervals, each of size x. The size of each bin, x, is selected such that each

dimension has a minimum of 1000 ﬁne bins. If the range of the dimension is from

m to n then we set the number of bins in that dimension to be max(1000, (n −

m)) and correspondingly ﬁnd the value of x. We assume that all attributes have

been normalized to the same base quantity while ﬁnding m, n and hence x. The

maximum of the histogram value within a window is taken to reﬂect the window

value. Adjacent windows whose values diﬀer by less than a threshold percentage, β,

are merged together to form larger windows ensuring that we divide the dimensions

into variable sized bins which capture the data distribution. However, in dimensions

where data is uniformly distributed this results in a single bin and indicates much

less likelihood of ﬁnding a cluster. However, we split the domain into a small ﬁxed

number of partitions, set a high threshold as this dimension is less likely to be part

of a cluster and collect statistics for these bins to examine further. This technique

greatly reduces the computation time by limiting the degree of bin contribution

from non-cluster dimensions . Variable sized bins are assigned a variable threshold.

A variable sized bin is likely to be part of a cluster if it has a signiﬁcantly (which we

call cluster dominance factor α [15]) greater number of points than it would have

had, had the data been uniformly distributed in that dimension. Thus, for a bin of

size a in a dimension of size D

we set its threshold to be

αaN

, where N is the total

number of data points.

Eﬀect of Adaptive Grids on Computation

Figure 1 shows the histogram of data in two dimensions. Figure 1(a) shows the user

deﬁned uniform sized grid used in the approach of [7] resulting in a large number

of CDUs (rectangles in the ﬁgure). With the increase in data set size, each pass

Algorithm 3.1 Adaptive Grid Computation

- Domain of A

N - Total number of data points in the data set

a - Size of a generic bin

for each dimension A

,i ∈ (1,...d)

Divide D

into windows of some small size x

Compute the histogram for each unit of A

, and set the value of

the window to the maximum in the window

From left to right merge two adjacent units if they are within a threshold β

/* Single bin implies an equi-distributed dimension */

if(number of bins == 1)

Divide the dimension A

into a ﬁxed number of equal partitions.

Compute the threshold of each bin of size a as

αaN

end

(a) (b)

Figure 1. (a) Uniform grid size (b) Adaptive grid size

over the data in the bottom-up algorithm results in evaluating a large number of

candidate dense units. However, as seen in Figure 1(b) adaptive grids makes use

of the data distribution and computes the minimum number of bins as required in

each dimension resulting in very few CDUs.

Eﬀect of Adaptive Grids on Quality of Clustering

Figure 2 shows a ’plus’ shaped cluster (abcdefghijkl) as discovered by [7] and

our algorithm using adaptive grids. The cluster reported by CLIQUE, pqrs, is

shown in Figure 2(a). Parts of the cluster discovered by [7] are not in the original

deﬁned cluster abcdefghijkl and also parts of the original cluster are thrown away

as outliers. MAFIA uses adaptive grid boundaries and so the cluster deﬁnitions are

minimal DNF expressions representing the clusters accurately. MAFIA develops

grid boundaries very close to the boundaries of the cluster and reports abcdefghijkl

shown in Figure 2(b) as the cluster with the DNF expression (l, y) ∧(m, z) ∧ (n, y) ∧

(m, x) ∧ (m, y).

4 MAFIA Implementation

MAFIA consists mainly of the following steps. The algorithm starts by making a

pass over the data in chunks of B records to enables scaling to out-of-core data

sets. A histogram of the data is built in each dimension as elaborated in Algorithm

3.1. The adaptive ﬁnite intervals in every dimension is determined and the bin sizes

评论收藏

内容反馈

版权申诉

处处清欢

粉丝: 1622
资源: 2828

大规模数据的网格.zip

网络游戏-一种大规模数据的网络化分组优化下料方法.zip

网络游戏-一种大规模数据回归神经网络快速训练方法.zip

QBQTC大规模搜索匹配数据集.zip

CSL大规模中文科学文献数据集.zip

Python_使用Sparseview大型重建模型从单个图像生成高效3D网格.zip

网格划分.zip网格划分.zip网格划分.zip

要领-准备数据集.zip

Oracle Database 19c (LINUX.X64-193000-gateways.zip)

geode.apache.org.zip

meshtool, 使用pycollada操作网格数据的工具.zip

大数据支持大规模个性化教学的发生逻辑.zip

网络游戏-大规模信息网络中数据语义信息的处理方法.zip

全球公里网格GDP数据.zip

企业级Javascript数据网格AG-Grid.zip

点云数据组织.zip_.xyz 点云_.xyz格式读取_xyz文件_读取xyz点云

点云精简_均匀&不均匀网格法.zip

自动化运维和管控大规模云服务器.zip

行业分类-设备装置-处理器特性优化和通过域分解的大规模系统优化.zip

matlab开发-GridIntersectionszp.zip.zip

大规模电梯按键分割和字符识别数据集.zip

RdlsChina1km（中国地形起伏度公里网格数据集）.zip

中国公里网格GDP分布数据集-2010.zip

中国公里网格GDP分布数据集-2005.zip

数据架构设计与实践-LSQL大规模集群实践V3.zip

市域网格化治理标准体系建设指南.zip.zip

text-classification论文.zip

训练数据_来自网络脱敏数据.zip

基于Python语言Kaggle的数据集分析.zip

行业分类-设备装置-从各个网格单元将数据写回数据提供者.zip

最新资源