Low Latency Performance Tuning Guide for Red Hat Enterprise Linux 6

所需积分/C币:17 2017-12-26 10:18:28 352KB PDF
收藏 收藏

Low Latency Performance Tuning Guide for Red Hat Enterprise Linux 6
Table of contents 1 EXecutive Summary…… 2 Process Scheduling 2. 1 Avoiding Interference 2.2 Scheduler tunable 2.3 Perf 111233 3 Memory 31 NUMA Topology.....................,, 3.2 Hugepages 3.3 Transparent Hugepages……... etwork 3334444 4. 1 IRQ processing 4.2 Drivers ::::.:aaa:.a:aa::a.a::aaaa.:::a.:a::a 4.3 Network tunable 5 4.4 ethtool 5 5 Power 1■■1■D1■■■DDD1■■DD1DD■D1■D1■B■■■■■重■■■■■■■B重■■直DDD■■ 6 5.1 Tuned....….……………………………….6 5.2 BIOS Confiquration 6 Kernel boot parameters 7 References 8 Revision History.…, ■■■1■D1■■ ■11道D1 refarch-feedback@redhat.com www.rechAt.com 1 Executive Summary This paper provides a tactical tuning overview of Red Hat Enterprise Linux 6 for latency- sensitive workloads. In a sense, this document is a cheat-sheet for getting started on this complex topic and is intended to complement existing Red Hat documentation It is very important to gain a deep understanding of these tuning suggestions before you apply them in any production scenario, as your-mileage-will-vary. Note that certain features mentioned in this paper may require the latest minor version of Red Hat Enterprise Linux 6 2 Process Scheduling Linux provides the user with a system of priorities and scheduler policies that influence the task scheduler. A common trait of latency-sensitive tasks is that they should run continuously on the CPU with as few interruptions as possible. If your application is being scheduled off the CPU in favor of an unimportant task, try increasing its priority using nice. Another option is to change the scheduler policy to FIFO with a priority of 1. the example command below runs YOURPROC ahead of the userspace and some kernel threads. Note there are some kernel threads that run fifo policy. This has been met with mixed results and often using SCHED OTHER via nice performs similarly chrt -f 1.yourproc nice -20,/YOURPROC When using the SCHED FIFo policy for your application, it is possible to introduce latency spikes or other anomalies by blocking kernel threads that use SCHED OTHER. All SCHED OTHER tasks are of lower priority than SCHED FIFo. For this reason it is important that you test extensively when using the FIFO scheduler. As an example, a kernel thread relevant to networking is ksoftirgd. to determine if observed latency blips are priority related, try the below command. In the report you might see ksoftirgd at the top of the list, meaning it was likely blocked perf sched record -o/dev/shm/perf data / YOURPROC perf sched -i /dev/shm/perf data latency >/tmp/perf.latency. report Task Runtime ms Switches Average delay ms ksoftirqd/2: 23 26|avg:3388.621ms 2工 Avoiding Interference To elaborate further on process priorities and scheduling, it is important to streamline the system such that only essential tasks are scheduled, for example, by ensuring that you have disabled all unnecessary services. There is limited flexibility with regard to kernel threads as compared to userspace threads. Here are some options for task affinity and isolation to reduce jitter and latency refarch-feedback@redhat.com www.rechAt.com isolcpus Isolate CPU cores from userspace tasks. this can be done with i.e isolcpus=1, 3, 5,7,9, 11. isolcpus requires a reboot, and there will continue to be kernel threads on isolated cores. isolcpus sets the affinity of the init task to the inverse bitmask of that specified by isolcpus. All userspace tasks will inherit from init, including any new tasks. To land a task on one of the isolated cores, you need to explicitly ask for it via taskset/numactl or sched_setaffinity See the Tuna User Guide for a flexible isolation tool (part of the red Hat MRG Realtime product Isolate sockets from userspace processes, eventually pushing them to socket 0 tuna -s1-1 tuna -s2 -1 i tuna -s3 Push YOURPROC to core 1 using tuna and set sched/prio to fifo: 1 tuna -t pgrep yOURPRoc-c-p fifo: 1-s 1-m-p head -5 groups can provide a user-friendly way of grouping NUMA resources. An example is /etc/cgconfig conf group node i cpuset cpuset.cpus=“0,2,4,6,8,16 cpuset cpu_exclusive =1i cpuset mems =0; group node1 i cpuset i cpuset.cpus=“1,3,5,7,9,11"; cpuset cpu_exclusive =11 cpuset mems =1 i To launch a task within the nodel cgroup, use cgexec -g cpuset: node1 ./YOURPROC See the resource Management Guide for more information including how to use /etc/cgrules conf to write rules that automatically place new tasks in the desired cgroup 22 Scheduler tunable The following scheduler tunable have been found to impact latency-sensitive workloads and can be adjusted by the user sched_min_granularity_ns Set it to a lower value for latency-sensitive tasks(or huge thread-count). Set it to a higher value for compute-bound/throughput oriented workloads. Adjust by a factor of 2-10X sched migration cost www.redhat.com 2 refarch-feedback(@redhat.com Specifies the amount of time after the last execution during which a task is considered to be cache hot" in migration decisions. Increasing the value of this variable reduces task migrations. Adjust by a factor of 2-10X. task migrations may be irrelevant depending on the specific task affinity settings you ve configured 23 Perf Perf is a utility to read performance counters in both hardware(CPU)and kernel tracepoints When optimizing for better latency, you may be concerned with CPU counters such as cache misses, cpu-migrations, page-faults. Software events such as scheduler, network, the vm, or timers can also be tracked and reported using perf Documentation is available in the Developer Guide, man pages, or in the kernel source Here are some brief examples to get you started Find out what capabilities the CPU/kernel can provide. CPUs may differ perf list Find out what the CPU is currently spending cycles on. .e hot code paths/functions per f top Profile YOURPROC with regard to block, ext 4 and sched kernel tracepoints perf stat -e kmem: * -e block: *-e ext4: * -e sched: *./YOURPROC To view the hardware cache-misses counter triggered by the command Is perf stat -e cache-misses ls 3 Memory Recent server platforms are physically wired up in a NUMA configuration. When you pin a process to a specific CPU or core there are going to be certain combinations that will perform faster than others, because memory banks and/or Pcle slots are local to certain CPU sockets 3.1 NUMA Topology Use hwloc to visualize hardware topology. hwloc uses the bIOS SLIT table, and this table may or may not be fully populated depending on your server's B/OS. You may need to determine this experimentally. See hardware vendor documentation numactl: Bind apps to sockets and local memory banks. i. e CPU socket O/memory bank 0 numactl -no -mo,/YOURPROC refarch-feedback@redhat.com 3 www.rechAt.com taskset: Bind apps to cores. Kernel best-effort for memory locality taskset -c3,/YOURPROC View current CPU/memory pinning done with either numactl or taskset grep -1 allowed /proc/PID/status Adjust the Vm's tendency to swap by setting vm. swappiness=0 Consider disabling the ksm daemon Use the numastat or numa_faults. stp systemtap script to troubleshoot NUMA misses Use Intel Performance Counter Monitor to track QPI link usage Both network and storage performance can benefit from PCI slot locality and IRQ affinity 3.2 Hugepages The default 4KB page size in Linux can introduce a performance hit in systems with a large memory footprint. 2 Mb hugepages reduce the number of pages by a factor of 512. Here is one way to configure hugepages Getting information about hugepages egrep - thp trans huge /proc/meminfo /proc/vmstat 3.3 Transparent Hugepages Transparent hugepages(thP) is an abstraction layer that automates most aspects of creating, managing, and using hugepages thP instantiates a background kernel thread called khugepaged that scans memory for merging candidates. ThP is enabled by default Depending on your tolerance, these scans may introduce jitter. To disable THP transparent_ hugepage=never on the kernel cmdline (requires reboot) echo never >/sys/kernel/ mm/redhat transparent hugepage/enabled 4 Network When tuning network performance for low latency, the goal is to have irQs land on the same core or socket that is currently executing the application that is interested in the network packet. This increases CPU cache hit-rate(see section on perf)and avoids using the inter processor link 4.1 IRQ processing Red Hat has published an in-depth whitepaper on IRQ affinity tuning An automated way to achieve affinity(requires Red Hat Enterprise Linux 6.2 or later), is to enable Receive Flow Steering(resl. RFs (and its transmit counterpart XPS)are highly www.redhat.com refarch-feedback(@redhat.com dependent on hardware and driver capabilities. Certain drivers steer in hardware 4.2 Drivers Certain NiC vendors provide low-latency tuning guidelines, which you should read and test They may also provide scripts that line up iRQs with cores, i.e. a 1: 1 relationship. Others have the ability to vary the number of RX/tX queues. Consider using those scripts, or try irgbalance --oneshot 4 3 Network Tunable tcp_low_latency Demonstrated minimal impact (lus range) Use TCP NODELAY socket option where applicable 4.4 ethtool Many statistics(ethtool -S)are hardware/driver dependent. Some NiCs track stats in firmware, others in-kernel. When tracking is done in firmware, it is difficult to understand the true meaning of stats (i.e. What causes an error counter to increment. Consult your hardware vendor documentation Coalescing: ethtool -cC) Quantity of packets the nic will accept before triggering an interrupt. Experiment with a coalesce value of 0 or 1. Verify you have enough remaining CPU cycles to handle bursts Ring buffers: (ethtool -gG) Set of buffers that provide a very small amount of memory to deal with cases when the CPU is not available to process packets. Driver-dependent, output may be in slots or bytes. There has been a move to Byte Queue Limits, which avoids a head-of-line blocking problem caused by the slots technique. Increase the value to deal with drops, but watch for added latency Offload Capabilities: (ethtool -kK) Network adapter may provide some amount of offload capabilities, such as tCP Segmentation Offload (TSO), Large Receive Offload LRo etc. Offloads move some of the network processing out of the kernel and onto the NiC. These generally save CPU cycles but can be a mixed bag in terms of latency performance as many are designed to increase throughput and decrease CPU utilization through batching techniques. For example, Generic Receive Offload(GRo) may aggregate packets up to 64KB. Note that since offloads occur between the os and the wire, their properties are generally not observable with tcpdump on sender/receiver; use a port-mirror or equivalent tap device to see what the wire traffic looks like. Offloads will modify your packet quantity/frame size and flow characteristics. These may vary considerably from the mtu configured in the os or on the network switch refarch-feedback@redhat.com www.rechAt.com Consider. ethtool -k ethx iro off gro off tso off gso off Make the ethtool configuration persistent using udev rules, your application's startup scripts, or the EThTOOL OPts variable in /etc/sysconfig/network-scripts 5 Power Newer CPUs may alter their performance based on a workload heuristic in order to save power. This is at odds with latency-sensitive workload requirements, causing sub-optimal performance/jitter. Here is an example script to lock the system into certain c-states, that uses the kernel's / dev/cpu dma latency interface documented here Use the example script along with the values returned by the below command to lock CPU cores into particular C-states. CO provides the lowest latency, at cost of higher power, temperature, and cooling costs. Depending on latency requirements Cl may also be usable C3 and deeper show measurable performance impact, but should be fine for off-hours workloads find /sys/devices/system/cpu/cpu/cpuidle-name latency l 0 -name name args cat Use power top in red hat Enterprise linux or turbostat (in upstream kernel)to view current C-states 5.1 Tuned Consider using the latency-performance tuned profile. Currently, tuned profiles also set the 10 scheduler to deadline. If you use one of these profiles, (or clone one and add your site specific tuning) the kernel command-line option elevator=deadline may be redundant Most tuned profiles also change the cpuspeed governor to performance, ensuring the highest clock frequency. Alternatively, disable the cpuspeed service, or modify /etc/sysconfig/cpuspeed and set GOVERNOR=performance. To find current frequency, execute find /sys -name cpuinfo-curfreg xargs cat 5.2 BIOS Configuration Many server vendors have published blos configuration settings geared for lowlatency which must be carefully followed. Ensure you are running the latest BIOs version The rdmsr utility from the msr-tools package is used to read machine State Registers off of a CPU. For example, on certain newer generation Intel CPUs t/rdmsr -d ox34 97 This means 97 SMls have fired since boot. After implementing low-latency tuning guidelines from your server vendor, this counter should rarely(if ever) increment after boot. It would be useful to query this value before and after a test run to verify if any SMis have fired www.redhat.com 6 refarch-feedback(@redhat.com 6 Kernel boot parameters Click here for kernel documentation for Red Hat Enterprise Linux. An example low-latency command line nosoftlockup mce=ignore ce nosoftlockup disables logging of backtraces when a process executes on a CPU for longer than the softlockup threshold(default 120 seconds). Typical low-latency programming and tuning techniques might involve spinning on a core or modifying scheduler priorities/policies, which can lead to a task reaching this threshold. If a task has not relinquished the cpu for 120 seconds the kernel will print a backtrace for diagnostic purposes. Note that adding nosoftlockup to the cmdline will simply disable he printing of this backtrace and does not in itself reduce latency mce=ignore_ce ignores corrected errors and associated scans that can cause periodic latency spikes Consider audit=o to disable the kernel components of the audit subsystem which have been measured at about 1-3% CPu utilization when under heavy load. also ensure to chkconfig auditd off Missing from this list are idle, processor. max_cstate, intel_idle. max_cstate and idle. These options require a reboot to adjust, and thus we recommend the /dev/cpu dma latency interface as it achieves an equivalent performance improvement without the reboot requirement. You can save considerably in power and cooling costs by locking c-states only as necessary (i.e. during trading hours Consider disabling selinux using /etc/selinux/config( measured 1-3%CPU overhead) Use tuned profiles(i.e. enterprise-storage)to control l/O elevator, rather than cmdline 7 References Performance Tuning guide httpsaccessredhat.com/knowledge/docs/en-us/redHatEnterpriseLinux/6/html- single/Performance Tuning guide/index. htm Realtime tuning guide httpslaccess.redhatcom/knowledge/docs/en-us/redHatEnterpriseMrg/2/Html- single/Realtime Tuning Guide/index. html Resource Management Guide https:/access.redhat.com/knowledge/docs/en-us/redHatEnterpriseLinux/6/html- single/Resource Management Guide/index. html Developer Guide httpsaccessredhatcom/knowledge/docs/en-us/redHatEnterpriseLinux/6/html- single/Developer Guide/index. html#perf refarch-feedback@redhat.com 7 www.rechAt.com

试读 12P Low Latency Performance Tuning Guide for Red Hat Enterprise Linux 6
立即下载 低至0.43元/次 身份认证VIP会员低至7折
    关注 私信 TA的资源
    Low Latency Performance Tuning Guide for Red Hat Enterprise Linux 6 17积分/C币 立即下载
    Low Latency Performance Tuning Guide for Red Hat Enterprise Linux 6第1页
    Low Latency Performance Tuning Guide for Red Hat Enterprise Linux 6第2页
    Low Latency Performance Tuning Guide for Red Hat Enterprise Linux 6第3页
    Low Latency Performance Tuning Guide for Red Hat Enterprise Linux 6第4页


    17积分/C币 立即下载 >