The Design and Implementation of Open vSwitch

所需积分/C币:10 2018-08-07 14:24:00 397KB PDF
收藏 收藏

The Design and Implementation of Open vSwitch,The Design and Implementation of Open vSwitch
kernel datapath This allows the datapath module to re Controller main unaware of the particulars of the OpenFlow wire protocol, further simplifying it. From the Open Fle -- OVSDB·----penF|oW·--- troller's point of view, the caching and separation into user and kernel components are invisible implementation acvsdb-server ovs-yswitchd details: in the controllers view, each packet visits a series of Open Flow fow tables and the switch finds the highest priority flow whose conditions are satisfied by the packet, First Packet Kenel Datapath and executes its Open Flow actions. Packets The flow programming model of Open v Switch largely determines the use cases it can support and to this end, Figure I: The components and interfaces of Open vSwitch. The Open vSwitch has many extensions to standard Open Flow first packet of a flow results in a miss, and the kernel module to accommodate network virtualization We will discuss directs the packet to the userspace component, which caches the these extensions shortly, but before that, we turn our focus forwarding decision for subsequent packets into the kernel on the performance critical aspects of this design: packet classification and the kernel-userspace interface datapath kernel module, is usually writ ally for the host operating system for performance 3. 2 Packet Classification Figure 1 depicts how the two main ovs components work together to forward packets. The datapath module Algorithmic packet classification is expensive on general in the kernel receives the packets first, from a physical purpose processors, and packet classification in the con- NIC or a Vms virtual NIC. Either ovs-vswitchd has text of Open Flow is especially costly because of the gen instructed the datapath how to handle packets of this type, erality of the form of the match, which may test any com or it has not. In the former case, the datapath module bination of Ethernet addresses, IPv4 and IPv6 addresses simply follows the instructions, called actions, given by TCP and UDP ports, and many other fields, including ovs-vswitchd, which list physical ports or tunnels on packet metadata such as the switch ingress port which to transmit the packet. Actions may also specify Open vSwitch uses a tuple space search classifier[341 packet modifications, packet sampling, or instructions to for all of its packet classification, both kernel and drop the packet. In the other case, where the datapath userspace. To understand how tuple space search works has not been told what to do with the packet, it delivers assume that all the flows in an Open vSwitch flow ta- it to ovs-vswitchd In userspace, ovs-vswitchd deter- ble matched on the same fields in the same way, e. g, all mines how the packet should be handled, then it passes flows match the source and destination Ethernet address the packet back to the datapath with the desired handling. but no other fields. a tuple search classifier implements Usually, ovs-vswitchd also tells the datapath to cache such a flow table as a single hash table. If the controller the actions, for handling similar future packets then adds new flows with a different form of match the In Open vSwitch, flow caching has greatly evolved classifier creates a second hash table that hashes on the over time, the initial datapath was a microfow cache, fields matched in those flows. (The tuple of a hash table essentially caching per transport connection forwarding in a tuple space search classifier is, properly, the set of decisions. In later versions, the datapath has two layers of fields that form that hash table's key, but we often refer caching: a microflow cache and a secondary layer, called to the hash table itself as the tuple, as a kind of useful a megaflow cache, which caches forwarding decisions for shorthand )With two hash tables, a search must look in traffic aggregates beyond individual connections. We will both hash tables. If there are no matches, the flow table return to the topic of caching in more detail in Section 4. doesn t contain a match; if there is a match in one hash Open vSwitch is commonly used as an SDN switch, table, that flow is the result; if there is a match in both, and the main way to control forwarding is Open Flow [27]. then the result is the flow with the higher priority. As the Through a simple binary protocol, Open Flow allows a controller continues to add more flows with new forms of controller to add remove update monitor, and obtain match, the classifier similarly expands to include a hash statistics on flow tables and their flows, as well as to table for each unique match, and a search of the classifier divert selected packets to the controller and to inject pack must look in every hash table ets from the controller into the switch. In Open vSwitch, While the lookup complexity of tuple space search is ovs-vswitchd receives Open Flow flow tables from an far from the state of the art [8, 18, 38, it performs well SDN controller, matches any packets received from the with the flow tables we see in practice and has three attr datapath module against these open Flow tables, gathers tive properties over decision tree classification algorithms the actions applied, and finally caches the result in the First, it supports efficient constant-time updates(an up date translates to a single hash table operation), which of the multiple tables consulted in series by forwarding makes it suitable for use with virtualized environments ASICs, and OpenFlow 1. I introduced multi-table support where a centralized controller may add and remove flows Open v Switch adopted the new model but retained its sup- ften, sometimes multiple times per second per hyper- port for the resubmit action for backward compatibility visor, in response to changes in the whole datacenter. and because the new model did not allow for recursion Second, tuple space search generalizes to an arbitrary but only forward progress through a fixed table pipeline number of packet header fields, without any algorithmic At this point, a controller could implement programs change. Finally, tuple space search uses memory linear in in Open v Switch flow tables that could make decisions the number of flows based on packet headers using arbitrary chains of logic The relative cost of a packet classification is further but they had no access to temporary storage. To solve that amplified by the large number of flow tables that so- problem, Open vSwitch extended Open Flow in another phisticated SDn controllers use. For example, flow ta- way, by adding meta-data fields called registers"that bles installed by the VMware network virtualization con- Hlow tables could match, plus additional actions to mod troller [19] use a minimum of about 15 table lookups per ify and copy them around. With this, for instance, flows packet in its packet processing pipeline. Long pipelines could decide a physical destination early in the pipeline, are driven by two factors: reducing stages through cross- then run the packet through packet processing steps identi producting would often significantly increase the flow cal regardless of the chosen destination, until sending the table sizes and developer preference to modularize the packet, possibly using destination-specific instructions pipeline design. Thus, even more important than the per- As another example, VMware's NVP network virtual formance of a single classifier lookup, it is to reduce the ization controller [19] uses registers to keep track of a number of flow table lookups a single packet requires, on packets progress through a logical L2 and L3 topology average implemented as logical datapaths "that it overlays on the physical Open Flow pipeline 3.3 Open Flow as a Programming model Open Flow is specialized for flow-based control of a switch. It cannot create or destroy Open Flow switches Initially, Open vSwitch focused on a reactive flow pro- add or remove ports, configure Qos queues, associate gramming model in which a controller responding to OpenFlow controller and switches. enable or disable traffic installs microflows which match every supported STP(Spanning Tree Protocol), etc. In Open vSwitch, OpenFlow field. This approach is easy to support for soft this functionality is controlled through a separate com ware switches and controllers alike, and early research ponent, the configuration database. To access the con suggested it was sufficient [3. However, reactive pro figuration database, an Sdn controller may connect to gramming of microflows soon proved impractical for use ovsdb-server over the OVSDB protocol [281, as shown outside of small deployments and Open vSwitch had to in Figure I In general, in Open vSwitch, OpenFlow con adapt to proactive flow programming to limit its perfor- rols potentially fast-changing and ephemeral data such mance costs as the flow table whereas the configuration database con- In OpenFlow 1.0, a microflow has about 275 bits of tains more durable state formation, so that a flow table for every microflow would have 24 or more entries. Thus, proactive population 4 Flow Cache design of flow tables requires support for wildcard matching to cover the header space of all possible packets. With a This section describes the design of flow caching in Open single table this results in a"cross-product problem: to vSwitch and how it evolved to its current state vary the treatment of packets according to ni values of field A and n2 values of field B, one must install n1 X 12 4.1 Microflow Caching Flows in the general case, even if the actions to be taken based on A and B are independent. Open vSwitch soon In 2007, when the development of the code that would introduced an extension action called resubmit that allows become Open vSwitch started on Linux, only in-kernel packets to consult multiple flow tables (or the same table packet forwarding could realistically achieve good per multiple times), aggregating the resulting actions. This formance, so the initial implementation put all Open Flow solves the cross-product problem, since one table can con- processing into a kernel module. The module received a tain nI flows that consult A and another table n flows packet from a Nic or vm, classified through the Open- that consult B. The resubmit action also enables a form Flow table(with standard Open Flow matches and actions) of programming based on multiway branching based on modified it as necessary, and finally sent it to another port the value of one or more fields. Later Open flow vendors This approach soon became impractical because of the focusing on hardware sought a way to make better use relative difficulty of developing in the kernel and distribut- ing and updating kernel modules. It also became clear a generic Open Flow table than the microflow cache does that an in-kemel Open Flow implementation would not be due to its support for arbitrary packet field matching, it acceptable as a contribution to upstream Linux, which is is still strictly simpler and lighter in runtime for two pri- an important requirement for mainstream acceptance for mary reasons. First, it does not have priorities, which software with kernel components speeds up packet classification: the in-kernel tuple space Our solution was to reimplement the kernel module search implementation can terminate as soon as it finds as a microflow cache in which a single cache entry ex- any match, instead of continuing to look for a higher act matches with all the packet header fields supported priority match until all the mask-specific hash tables are by Open Flow. This allowed radical simplification, by inspected. (To avoid ambiguity, userspace installs only implementing the kernel module as a simple hash table disjoint megaflows, those whose matches do not overlap. rather than as a complicated, generic packet classifier, Second, there is only one megaflow classifier, instead of supporting arbitrary fields and masking. In this design, a pipeline of them, so userspace installs megaflow en cache entries are extremely fine-grained and match at tries that collapse together the behavior of all relevant most packets of a single transport connection: even for a Open Flow tables single transport connection a change in network path and The cost of a megaflow lookup is close to the genera hence in IP TTL field would result in a miss, and would purpose packet classifier, even though it lacks support divert a packet to userspace, which consulted the actual for flow priorities. Searching the megaflow classifier re- Open Flow flow table to decide how to forward it. This quires searching each of its hash tables until a match is implies that the critical performance dimension is flow found; and as discussed in Section 3.2, each unique kind setup time, the time that it takes for the kernel to report a of match in a low table yields a hash table in the clas- microflow"miss"to userspace and for userspace to reply. sifier. Assuming that each hash table is equally likely Over multiple Open vSwitch versions, we adopted to contain a match, matching packets require searching several techniques to reduce flow setup time with the (n+ 1)/2 tables on average, and non-matching packets microflow cache. Batching flow setups that arrive to- require searching all n. Therefore, for n> l, which is gether improved flow setup performance about 24%, for usually the case, a classifier-based megaflow search re example, by reducing the average number of system calls quires more hash table lookups than a microflow cache equired to set up a given microflow. Eventually, we Megaflows by themselves thus yield a trade-off: one must also distributed flow setup load over multiple userspace bet that the per-microflow benefit of avoiding an extra trip threads to benefit from multiple CPu cores. Drawing in- to userspace outweighs the per-packet cost of the extra spiration from Cuckoo Switch [42], we adopted optimistic hash lookups in form of megaflow lookup concurrent cuckoo hashing [6] and rCU [23] techniques Open vSwitch addresses the costs of megaflows by to implement nonblocking multiple-reader, single-writer retaining the microflow cache as a first-level cache,con flow tables sulted before the megaflow cache. This cache is a hash ta- After general optimizations of this kind customer feed- ble that maps from a microflow to its matching megaflow back drew us to focus on performance in latency-sensitive Thus, after the first packet in a microflow passes through applications, and that required us to reconsider our simple the kernel megaflow table, requiring a search of the kernel caching design classifier, this exact-match cache allows subsequent pack- ets in the same microflow to get quickly directed to the 4.2 Megaflow Caching appropriate megaflow. This reduces the cost of megaflows from per-packet to per-microflow. The exact-match cache While the microflow cache works well with most traffic is a true cache in that its activity is not visible to userspace patterns, it suffers serious performance degradation when other than through its effects on performance faced with large numbers of short lived connections. Ii A megaflow flow table represents an active subset of this case, many packets miss the cache, and must not only the cross-product of all the userspace Open Flow flow cross the kernel-userspace boundary, but also execute a tables. To avoid the cost of proactive crossproduct com long series of expensive packet classifications. While putation and to populate the megaflow cache only with batching and multithreading can somewhat alleviate this entries relevant for current forwarded traffic stress, they are not sufficient to fully support this work- v Switch userspace daemon computes the cache entries incrementally and reactively. As Open vSwitch processes We replaced the microflow cache with a megaflow a packet through userspace flow tables, classifying the cache. The megaflow cache is a single fow lookup packet at every table, it tracks the packet field bits that table that supports generic matching, i.e., it supports were consulted as part of the classification algorithm. The caching forwarding decisions for larger aggregates of generated megaflow must match any field (or part of a traffic than connections. While it more closely resembles field) whose value was used as part of the decision. For example, if the classifier looks at the Ip destination field function PRIORITYSORTEDTUPLESEARCH(H) in any Open Flow table as part of its pipeline, then the B<NULL / Best flow match so far: y megaflow cache entrys condition must match on the des tination IP as well. This means that incoming packets ifB≠ NULL and b.pri≥ T pri mar them for tuple T in descending order of T primax d drive the cache population, and as the aggregates of the return B traffic evolve, new entries are populated and old entries if t contains a fow f matching h then remove if B= NULL or Fpri> B pri then The foregoing discussion glosses over some details BEF The basic algorithm, while correct, produces match con- return B ditions that are more specific than necessary, which trans Figure 2: Tuple space search for target packet headers H, with lates to suboptimal cache hit rates. Section 5, below, de- priority sorting scribes how Open vSwitch modifies tuple space search to yield better megaflows for caching. Afterward, Section 6 ddresses cache invalidation 5.2 Tuple Priority Sorting Lookup in a tuple space search classifier ordinarily re 5 Caching-aware Packet Classification quires searching every tuple. Even if a search of an early tuple finds a match, the search must still look in the other tuples because one of them might contain a matching How We now turn our focus on the refinements and improve with a higher priority ments we made to the basic tuple search algorithm(sum We improved on this by tracking, in each tuple T, the marized in Section 3. 2)to improve its suitability for flow maximum priority tpri-max of any flow entry in T. We caching modified the lookup code to search tuples from greatest to least maximum priority, so that a search that finds a matching flow f with priority fpri can terminate as soon 5.1 Problem as it arrives at a tuple whose maximum priority is f.pri As Open v Switch userspace processes a packet through or less, since at that point no better match can be found its Open Flow tables, it tracks the packet field bits that Figure 2 shows the algorithm in detail were consulted as part of the forwarding decision. This As an example, we examined the Open Flow table in- bitwise tracking of packet header fields is very effective i stalled by a production deployment of VMware's NVP constructing the megaflow entries with simple Or perLow controller [19. This table contained 29 tuples Of those flow tables 29 tuples, 26 contained flows of a single priority, which For example, if the Open Flow table only looks at makes intuitive sense because flows matching a single tuple tend to share a purpose and therefore a priority Ethernet addresses (as would a flow table based on L2 MAC learning), then the megafows it generates will When searching in descending priority order, one can al also look only at Ethernet addresses. For example, port ways terminate immediately following a successful match n such a tuple. Considering the other tuples, two con- scans(which do not vary ethernet addresses) will not tained flows with two unique priorities that were higher cause packets to go to userspace as their L3 and L4 header than those in any subsequent tuple, so any match in ei fields will be wildcarded resulting in near-ideal megaflow ther of these tuples terminated the search The final tu- cache hit rates. On the other hand. if even one fow entry ple contained fows with five unique priorities ranging in the table matches on the TCP destination port, tuple from 32767 to 36866: in the worst case. if the lowest space search will consider the tcp destination port of every packet. Then every megaflow will also match on the priority flows matched in this tuple, then the remaining tuples with T,pri_max>32767(up to 20 tuples based TCP destination port, and port scan performance again on this tuples location in the sorted list), must also be searched We do not know of an efficient online algorithm to gen- erate optimal, least specific megaflows, so in development 5.3 Staged Lookup we have focused our attention on generating increasingly good approximations. Failing to match a field that must Tuple space search searches each tuple with a hash ta- be included can cause incorrect packet forwarding, which ble lookup In our algorithm to construct the megaflow makes such errors unacceptable, so our approximations matching condition, this hash table lookup means that are biased toward matching on more fields than neces the megaflow must match all the bits of fields included sary. The following sections describe improvements of in the tuple, even if the tuple search fails, because every this type that we have integrated into Open vSwitch one of those fields and their bits may have affected the lookup result so far. When the tuple matches on a field fields covered by the tuple that varies often from flow to flow, e.g. the TCP source This optimization fixes a performance problem ob port, the generated megaflow is not much more useful served in production deployments. The NvP controller than installing a microflow would be because it will only uses Open vSwitch to implement multiple isolated logi- match a single tCP stream cal datapaths (further in ted to fo This points to an opportunity for improvement. If one works). Each logical datapath is independently configured could search a tuple on a subset of its fields, and determine Suppose that some logical datapaths are configured with with this search that the tuple could not possibly match, ACLs that allow or deny traffic based on L4(e. 8, TCP then the generated megaflow would only need to match or UDP) port numbers. Megaflows for traffic on these on the subset of fields, rather than all the fields in the logical datapaths must match on the La port to enforce the tuple ACLS. Megaflows for tratfic on other logical datapaths The tuple implementation as a hash table over all its need not and, for performance should not match on L4 fields made such an optimization difficult. One cannot port. Before this optimization, however, all generated search a hash table on a subset of its key. We considered megaflows matched on L4 port because a classifier search other data structures. A trie would allow a search on any had to pass through a tuple that matched on L4 port. The prefix of fields, but it would also increase the number of optimization allows megaflows for traffic on logical dat- memory accesses required by a successful search from apaths without I 4 ACLs to avoid matching on L 4 port, O(1)to O(n)in the length of the tuple fields. Individual because the first three (or fewer)stages are enough to per-field hash tables had the same draw back. We did not determine that there is no match consider data structures larger than O(n)in the number of flows in a tuple, because Open Flow tables can have hundreds of tho ds of fe The solution we implemented statically divides fields 5.4 Prefix Tracking in decreasing order of traffi metadata(e. g, the switch ingress port), L2, L3, and L4 Flows in OpenFlow often match IPv4 and IPv6 subnets to We changed each tuple from a single hash table to an implement routing. When all the fiows that match on such array of four hash tables, called stages: one over metadata a eld use the same subnet size, e.g, all match /16 sub fields only, one over metadata and L2 fields, one over nets, this works out fine for constructing megaflows. If, metadata,L2, and L3 fields, and one over all fields (The on the other hand, different flows match different subnet latter is the same as the single hash table in the previous sizes, like any standard IP routing table does, the con- implementation. A lookup in a tuple searches each of its structed megaflows match the longest subnet prefix, e. g stages in order. If any search turns up no match, then the any host route(/32)forces all the megaflows to match full overall search of the tuple also fails, and only the fields addresses. Suppose, for example, Open vSwitch is con included in the stage last searched must be added to the structing a megaflow for a packet addressed to megaflow match flows match subnet 10/8 and host 10 1.2.3/32. one could This optimization technique would apply to any subsets safely install a megaflow for 10.5/16(because 10.5/16 of the supported fields, not just the layer-based subsets is completely inside 10/8 and does not include 10 1.2.3) we used. We divided fields by protocol layer because, but without additional optimization Open vSwitch installs as a rule of thumb, in TCP/IP, inner layer headers tend (Our examples use only octet prefixes, e.g., 78 to be more diverse than outer layer headers. At La, for /16,/24, /32, for clarity but the implementation and the example, the TCP source and destination ports change on pseudocode shown later work in terms of bit prefixes. a per-connection basis, but in the metadata layer only a We implemented optimization of prefixes for IPv4 and relatively small and static number of ingress ports exist. IPv6 fields using a trie structure. If a flow table matches Each stage in a tuple includes all of the fields in earlier over an IP address, the classifier executes an LPM lookup stages. We chose this arrangement, although the tech- for any such field before the tuple space search, both to d nique does not require it, because then hashes could be termine the maximum megaflow prefix length required, as computed incrementally from one stage to the next, and well as to determine which tuples can be skipped entirely profiling had shown hash computation to be a significant without affecting correctness. As an example, suppose cost(with or without staging) an Open Flow table contained flows that matched on some With four stages, one might expect the time to search a IPY4 field, as shown tuple to quadruple. Our measurements show that, in fact classification speed actually improves slightly in practice because, when a search terminates at any early stage. the This is a slight simplification for improved clarity; the actual imple mentation reverts to prefix tracking if staged lookups have concluded to lassifier does not have to compute the full hash of all the include an iP field to the match 20/8 function TRIESEARCH(value, root) 10.1/16 nOle{r0ot,Pre←NUL 10.2/16 plens <bit-array of len(value)o-bits 10.1.3/24 l← 10.14.5/32 while node≠ null do C 0 These flows correspond to the following trie, in which solid circle represents one of the address matches listed while c< len(node bits) do above and a dashed circle indicates a node that is present if valueif node bits c ther return(i+1,lens only for its children C←C+1,i←t+1 if node n_rules>o then plans[i-1]←1 ifi≥len( value)then return(i, lens) prv←node if value i-0 then node← node left Ode←node. right To determine the bits to match, Open vSwitch traverses prev Null and prev has at least one child then the trie from the root down through nodes with labels i<-i+1 matching the corresponding bits in the packet's IP addres If traversal reaches a leaf node, then the megatlow need return(L, pens not match the remainder of the address bits, e.g., in our Figure 3: Prefix tracking pseudocode. The function searches example 10 1.3.5 would be installed as 10.1.3/24 and for value (e.g. an IP address) in the trie rooted at node root. It 20.0.5. 1 as 20/8. If, on the other hand, traversal stops returns the number of bits at the beginning of value that must be due to the bits in the address not matching any of the examined to render its matching node unique and a bit-array of corresponding labels in the tree, the megaflow must be possible matching lengths. In the pseudocode, xli is bit i in x constructed to match up to and including the bits that and len(x)the number of bits in x. could not be found, e.,10.3.5. 1 must be installed as 10.3/16and30.10.5.2as30/8 The trie search result also allows Open vSwitch to skip longest prefix match in OpenFlow, the fiows with longer searching some tuples. Consider the address 10 1.6.1 prefix must have higher priorities, which will allow the a search of the above trie for this address terminates tuple priority sorting optimization in Section 5.2 to skip at the node labeled 1, failing to find a node to follow prefix matching tables after the longest match is found, for the address's third octet. This means that no flow in but this alone causes megaflows to unwildcard address the flow table with an IP address match longer than 16 bits according to the longest prefix in the table. The main bits matches the packet, so the classifier lookup can skip practical benefit of this algorithm, then, is to prevent poli searching tuples for the flows listed above with /24 and cies(such as a high priority aCL)that are applied to a /32 prefixes specific host from forcing all megaflows to match on a Figure 3 gives detailed pseudocode for the prefix match- full IP address. This algorithm allows the megaflow en ing algorithm. Each node is assumed to have members tries only to match with the high order bits sufficient to bits, the bits in the particular node (at least one bit, ex- differentiate the traffic from the host with acl cept that the root node may be empty ) left and right, the We also eventually adopted prefix tracking for L4 trans node's children (or NULL): and n _rules, the number of portport numbers. Similar to IP ACLs, this prevents high- rules in the node(zero if the node is present only for its priority ACls that match specific transport ports(e.g, to children, otherwise nonzero). It returns the number of block SMTP) from forcing all megaflows to match the bits that must be matched, allowing megaflows to be im- entire transport port fields, which would again reduce the proved, and a bit-array in which o-bits designate matching megaflow cache to a microflow cache 32] lengths for tuples that Open vSwitch may skip searching, as described above 5.5 Classifier Partitioning While this algorithm optimizes longest-prefix match lookups, it improves megaflows even when no flow ex- The number of tuple space searches can be further reduced plicitly matches against an IP prefix. To implement a by skipping tuples that cannot possibly match. Open Flow 8 supports setting and matching metadata fields during a needed changes. Open v Switch had to examine every dat packets trip through the classifier. Open vSwitch parti- apath flow for possible changes. Each flow had to be tions the classifier based on a particular metadata field. If passed through the Open Flow Aow table in the same way the current value in that field does not match any value in as it was originally constructed, then the generated ac a particular tuple, the tuple is skipped altogether tions compared against the ones currently installed in the While Open vSwitch does not have a fixed pipeline like datapath. This can be time-consuming if there are many traditional switches, NVP often configures each lookup in datapath fows, but we have not observed this to be a the classifier as a stage in a pipeline. These stages match problem in practice, perhaps because there are only large on a fixed number of fields, simi lar to a tuple By storing numbers of datapath flows when the system actually has a a numeric indicator of the pipeline stage into a specialized high network load, making it reasonable to use more CPU metadata field, NVP provides a hint to the classifier to on networking. The real problem was that, because Open efficiently only look at pertinent tuples vSwitch was single-threaded, the time spent re-examining all of the datapath flows blocked setting up new flows 6 Cache invalidation for arriving packets that did not match any existing dat apath flow. This added high latency to flow setup for The flip side of caching is the complexity of managing the those packets, greatly increased the overall variability of cache In Open vSwitch, the cache may require updating How setup latency, and limited the overall flow setup rate for a number of reasons. Most obviously, the controller Through version 2.0, therefore, Open vSwitch limited the can change the Open Flow flow table. Open Flow alse maximum number of cached flows installed in the data- specifies changes that the switch should take on its own in ath to about 1, 000, increased to 2, 500 following some reaction to various events, e.g, OpenFlow""behav- optimizations, to minimize these problems ior can depend on whether carrier is detected on a network The second group consisted of changes whose effects interface. Reconfiguration that turns features on or off. on datapath flows could be narrowed down, such as MAC adds or removes ports, etc, can affect packet handling. learning table changes. Early versions of Open vSwitch Protocols for connectivity detection, such as CFM [10] implemented these in an optimized way using a technique or BFD [14, or for loop detection and avoidance, e. g, called tags. Each property that, if changed, could require (Rapid)Spanning Tree Protocol, can influence behavior. megaflow updates was given one of these tagS. Also Finally, some Open Flow actions and Open vSwitch exten- each megaflow was associated with the tags for all of sions change behavior based on network state, e. g, based the properties on which its actions depended, e.g., if the on mac learning actions output the packet to port x because the packet's Ideally, Open vSwitch could precisely identify the destination mac was learned to be on that port, then the megaflows that need to change in response to some event. megaflow is associated with the tag for that learned fact For some kinds of events. this is straightforward. For ex- Later, if that MAc learned port changed, Open vSwitch ample, when the Open vSwitch implementation of MAc added the tag to a set of tags that accumulated changes learning detects that a MAC address has moved from one In batches, Open vSwitch scanned the megafiow table for port to another, the datapath flows that used that MAc megaflows that had at least one of the changed tags, and are the ones that need an update. But the generality of checked whether their actions needed an update the Open Flow model makes precise identification difficult Over time, as controllers grew more sophisticated and in other cases. One example is adding a new flow to an flow tables more complicated, and as Open vSwitch added Open Flow table. Any megaflow that matched a flow in more actions whose behavior changed based on network that Open Flow table whose priority is less than the new state, each datapath flow became marked with more and flow's priority should potentially now exhibit different more tags. We had implemented tags as Bloom filters [21 behavior but we do not know how to efficiently (in time which meant that each additional tag caused more "false and space) identify precisely those flows. The problem is positives "for revalidation, so now most or all flows re worsened further by long sequences of Open Flow How ta- quired examination whenever any state changed. By ble lookups. We concluded that precision is not practical Open vSwitch version 2.0, the effectiveness of tags had in the general case. declined so much that to simplify the code Open vSwitch Therefore, early versions of Open vSwitch divided abandoned them altogether in favor of always revalidating changes that could require the behavior of datapath flows the entire datapath flow table to change into two groups. For the first group, the changes Since tags had been one of the ways we sought to mini- whose effects were too broad to precisely identify the mize flow setup latency, we now looked for other ways 2lleader space analysis [16] provides the algebra to identify the flows In Open vSwitch 2.0, toward that purpose, we divided but the feasibility of efficient, online analysis( such as in [15]) in this userspace into multiple threads. We broke flow setup into context remains an open question separate threads so that it did not have to wait behind revalidation Datapath flow eviction, however, remained part of the single main thread and could not keep up with multiple threads setting up flows. Under heavy flow setup 三0.6 load, though, the rate at which eviction can occur is criti cal, because userspace must be able to delete flows from ≥ the datapath as quickly as it can install new flows, or the datapath cache will quickly fill up Therefore, in Open vSwitch 2. 1 we introduced multiple dedicated threads for l0000 cache revalidation, which allowed us to scale up the reval Number of flows idation performance to match the flow setup performance Figure 4: Min/mean/max megaflow flow counts observed and to greatly increase the kernel cache maximum size to about 200,000 entries The actual maximum is dynami cally adjusted to ensure that total revalidation time stays pervisors running Open vSwitch to serve mixed tenant under 1 second, to bound the amount of time that a stale workloads in network virtualization setting entry can stay in the cache Open vSwitch userspace obtains datapath cache statis- tics by periodically(about once per second) polling the Cache sizes. The number of active megaflows gives kernel module for every flows packet and byte counters us an indication about practical megaflow cache sizes The core use of datapath flow statistics is to determine Open vSwitch handles. In Figure 4, we show the CDF which datapath flows are useful and should remain in for minimum, mean and maximum counts during the stalled in the kernel and which ones are not processing a observation period. The plots show that small megaflow significant number of packets and should be evicted. Short caches are sufficient in practice: 50% of the hypervisors of the table's maximum size, flows remain in the datapath had mean flow counts of 107 or less. The 99th percentile until they have been idle for a configurable amount of of the maximum flows was still just 7, 033 flows. For the time, which now defaults to 10 S(Above the maximum hypervisors in this environment, Open vSwitch userspace size, Open vSwitch drops this idle time to force the table an maintain a sufficiently large kernel cache (With the to shrink. The threads that periodically poll the kernel for latest Open vSwitch mainstream version, the kernel How per flow statistics also use those statistics to implement limit is set to 200.000 entries OpenFlow's per-flow packet and byte count statistics and How idle timeout features. This means that Open Flow Cache hit rates. Figure 5 shows the effectiveness of statistics are themselves only periodically updated caching. The solid line plots the overall cache hit rate The above describes how userspace invalidates the dat- across each of the 10-minute measurement intervals apath's megaflow cache. Maintenance of the first-level across the entire population of hypervisors. The over- microflow cache(discussed in Section 4)is much simpler. all cache hit rate was 97. 7%. The dotted line includes just A microflow cache entry is only a hint to the first hash ta- the 25 of the measurement periods in which the fewest ble to search in the general tuple space search. Therefore, packets were forwarded, in which the caching was less a stale microflow cache entry is detected and corrected effective than overall, achieving a 74.7%o hit rate. Intu- the first time a packet matches it. The microflow cache itively, caching is less effective(and unimportant)when has a fixed maximum size, with new microflows replac- there is little to cache. Open vSwitch caching is most ing old ones, so there is no need to periodically flush old effective when it is most useful: when there is a great entries. We use a pseudo-random replacement policy, for deal of traffic to cache. The dashed line, which includes simplicity, and have found it to be effective in practice. just the 25%o of the measurement periods in which the most packets were forwarded, demonstrates this: during these periods the hit rate rises slightly above the overall 7 Evaluation average to 98.0%0 The following sections examine open vSwitch perfor The vast majority of the hypervisors in this data center mance in production and in microbenchmarks do not experience high volume traffic from their work- loads. Figure 6 depicts this: 99% of the hypervisors see fewer than 79, 000 packets/s to hit their caches (and fewer 7.1 Performance in Production than 1500 flow setups/s to enter userspace due to misses) We examined 24 hours of open vswitch performance data from the hypervisors in a large, commercial multi-tenant CPU usage. Our statistics gathering process cannot sep- data center operated by rackspace. Our data set contains arate Open vSwitch kernel load from the rest of the kernel statistics polled every 10 minutes from over 1, 000 hy- load, so we focus on Open vSwitch userspace. As we

试读 14P The Design and Implementation of Open vSwitch
立即下载 低至0.43元/次 身份认证VIP会员低至7折
    • 分享宗师

    关注 私信 TA的资源
    The Design and Implementation of Open vSwitch 10积分/C币 立即下载
    The Design and Implementation of Open vSwitch第1页
    The Design and Implementation of Open vSwitch第2页
    The Design and Implementation of Open vSwitch第3页
    The Design and Implementation of Open vSwitch第4页
    The Design and Implementation of Open vSwitch第5页


    10积分/C币 立即下载 >