Tuesday, January 27, 2009

The sar command

The sar command

The sar command gathers statistical data about the system.
Though it can be used to gather some useful data regarding system performance, the sar command can increase the system load that can exacerbate a pre-existing performance problem if the sampling frequency is high. But compared to the accounting package, the sar command is less intrusive. The system maintains a series of system activity counters which record various activities and provide the data that the sar command reports. The sar command does not cause these counters to be updated or used; this is done automatically regardless of whether or not the sar command runs. It merely extracts the data in the counters and saves it, based on the sampling rate and number of samples specified to the sar command.
With its numerous options, the sar command provides queuing, paging, TTY, and many other statistics. One important feature of the sar command is that it reports either system-wide (global among all processors) CPU statistics (which are calculated as averages for values expressed as percentages, and as sums otherwise), or it reports statistics for each individual processor. Therefore, this command is particularly useful on SMP systems.
There are three situations to use the sar command:

Real-time sampling and display

# sar -u 2 5

AIX ses12 1 6 000126C5D600 04/08/08

System configuration: lcpu=2 mode=Capped

19:42:43 %usr %sys %wio %idle physc

19:42:45 0 2 1 97 0.98

19:42:47 0 0 0 100 1.02

19:42:49 0 0 0 100 1.00

19:42:51 0 0 0 100 1.00

19:42:53 0 0 0 100 1.00

Average 0 1 0 99 1.00

# sar -o /tmp/sar.out 2 5 > /dev/null

# sar -f/tmp/sar.out

CPU and Performance Monitoring

CPU and Performance Monitoring

mpstat
The mpstat utility is a useful tool to monitor CPU utilization, especially with multithreaded applications running on multiprocessor machines, which is a typical configuration for enterprise solutions.
Use mpstat with an argument between 5 seconds to 10 seconds.An interval that is smaller than 5 or 10 seconds might be more difficult to analyze. A larger interval might provide a means of smoothing the data by removing spikes that could mislead the result.

#mpstat 10

What to Look For
Note the much higher intr and ithr values for certain CPUs. Solaris will select some CPUs to handle the system interrupts. The CPUs and the number that are chosen depend on the I/O devices attached to the system, the physical location of the devices, and whether interrupts have been disabled on a CPU (psradmin command).

· intr - interrupts
· intr - thread interrupts (not including the clock interrupts)
o csw - Voluntary Context switches. When this number slowly increases, and the application is not IO bound, it may indicate a mutex contention.

o icsw - Involuntary Context switches. When this number increases past 500, the system is under a heavy load.
o smtx - if smtx increases sharply. An increase from 50 to 500 is a sign of a system resource bottleneck (ex., network or disk).
o Usr, sys and idl - Together, all three columns represent CPU saturation. A well-tuned application under full load (0% idle) should be within 80% to 90% usr, and 20% to 10% sys times, respectively. A smaller percentage value for sys reflects more time for user code and less preemption, which result in greater throughput for Portal application.

Considerations

Make your application available to as many CPUs as it can efficiently use. As an example, you get the best performance from one instance from 2 CPUs. You can expect that creating 14 2CPU processor sets would yield the best performance.An increasing csw value shows an increase with network use. A common cause for a high csw value is the result of having created too many socket connections--either by not pooling connections or by handling new connections inefficiently. If this is the case you would also see a high TCP connection count when executing netstat -a wc–l. For more information, refer to netstat

Do you observe increasing icsw? A common cause of this is preemption, most likely because of an end of time slice on the CPU.

iostat

The iostat tool gives statistics on the disk I/O subsystem. The iostat command has many options. More information can be found in the man pages. The following typical options provide information on locating I/O bottlenecks.

#iostat -xn 10

What to Look For

· %b - Percentage of time the disk is busy (transactions in progress). Average %b values over 25 could be a bottleneck.
· %w - Percentage of time transactions are waiting for service (queue non-empty).
· asvc_t - Reports on average response time of active transactions, in milliseconds. This option is mislabeled asvc_t; it indicates the time between a user process issuing a read and the read completing. Consistent values over 30ms could indicate a bottleneck.

Considerations
Add more disks to the file system. When using a single disk file system, consider, upgrading to a hardware or software RAID is the next logical step. Hardware RAID is significantly faster than software RAID and is highly recommended. A software RAID solution would add additional CPU load to the system.
Depending on storage hardware and application behavior, there may be a better block size to use besides the ufs default of 8192k. For more information, consult the Solaris System Administration Guide.

netstat

The netstat tool gives statistics on the network subsystem. It can be used to analyze many aspects of the network subsystem, two of which are the TCP/IP kernel module and the interface bandwidth. An overview of both uses follow.

netstat -I hme0 10
These netstat options are used to analyze interface bandwidth. The upper bound (max) of the current throughput can be calculated from the output. The upper bound is reported because the netstat output reports the metric of packets, which do not necessarily have to be their maximum size. The upper bound of the bandwidth can be calculated using the following equation:
Bandwidth Used = (Total number of Packets) / (Polling Interval (10) ) ) * MTU (1500 default).
The current MTU for an interface can be found with: ifconfig -a

netstat -I hme0 10 Output

#netstat -I hme0 10

What to Look For

· colls - collisions. If your network is not switched, then a low level of collisions is expected. As the network becomes increasingly saturated, collision will increase and eventually will become a bottleneck. The best solution for collisions is a switched network.
· errs - errors. The presence of errors could indicate device errors. If your network is switched, errors indicate that you are nearly consuming the bandwidth capacity of your network. The solution to this problem is to give the system more bandwidth, which can be achieved through more network interfaces or a network bandwidth upgrade. This is highly dependent on your particular network architecture.

Considerations

· If network saturation is occuring quickly (saturation at less than 8CPUs for an application server running on a 100mbit Ethernet), then an investigation to ensure conservative network usage is a good first step.

.Increase network bandwidth. Steps that possibly can be taken: upgrade to a switched network, more network interfaces are a possible solution or upgrade to a higher bandwidth network to accommodate your network traffic demand.

netstat —sP tcp options are used to analyze the TCP kernel module. Many of the fields reported represent fields in the kernel module that indicate bottlenecks. These bottlenecks can be addressed using the ndd command and the tuning parameters.

#netstat -sP tcp Output

#netstat -sP tcp

TCP tcpRtoAlgorithm = 4 tcpRtoMin = 400

tcpInDupSegs = 1144 tcpInDupBytes =132520

tcpInPartDupSegs =1 tcpInPartDupBytes= 416

tcpInPastWinSegs =0 tcpInPastWinBytes =0

tcpInWinProbe =46 tcpInWinUpdate=48

tcpInClosed =251 tcpRttNoUpdate =344

tcpRttUpdate =1105386 tcpTimRetrans =989

tcpTimRetransDrop =5 tcpTimKeepalive=818

tcpTimKeepaliveProbe= 183 tcpTimKeepaliveDrop =0

tcpListenDrop= 0 tcpListenDropQ0 = 0

tcpHalfOpenDrop = 0 tcpOutSackRetrans = 56

What to look for

· tcpListenDrop - If after several looks at the command output the tcpListenDrop continues to increase, it could indicate a problem with queue size.

Considerations:

· A possible cause of increasing tcpListenDrop is the application throughput being bottlenecked by the number of executing threads. At this point increasing application threads may be a good thing to try.

Increase queue size. Increase the request queue sizes using ndd. More information on other ndd commands referenced in the Solaris Administration Guide.

ondd -set /dev/tcp tcp_conn_req_max_q value

ondd -set /dev/tcp tcp_conn_req_max_q0 value

netstat -a grep your_hostname wc -l

Running this command gives a rough count of socket connections on the system. The number of connections open at one time is limited; you can use this tool to look for bottlenecks.

netstat -a grep wc -l Output

netstat -a wc -l

34567

What to Look For

· socket count - If the number returned is greater than 20,000 then the number of socket connections could be a possible bottleneck.

Consider the following:

· Decrease the point where number of anonymous socket connections start.

ondd -set /dev/tcp tcp_smallest_anon_port value

Decrease the time a TCP connection stays in TIME_WAIT.

ondd -set /dev/tcp tcp_time_wait_interval value

Veritas Storage Foundation 5.0 Administration for UNIX (250-250)
Answer each question then check the correct answers provided at the bottom of the page.
1. A Volume Manager disk can be divided into one or more ________.

a. disk groups
b. partitions
c. slices
d. subdisks
2. What does an active/passive array provide?
a. load balancing using minimum I/O policy
b. high availability in the event of a total array failure
c. load balancing using the round robin I/O policy
d. high availability in the event of a path failure
3. Which configuration step must be completed prior to assigning a new disk to a disk group?
a. initialize the disk
b. create subdisks
c. assign disk space to volumes
d. configure the volume disk pool
4. Which vxdisk command options display disk information and the disk group status?
a. -o dg list
b. -o alldgs list
c. -o list alldgs
d. -o list dg
5. If the datadg disk group has a disk group version of 50, what occurs when you run the vxdg upgrade datadg command with no other options?
a. The disk group is upgraded by one disk group version level, in this case, to version 60
b. You receive an error stating that the disk group version must be specified
c. The disk group is upgraded to the latest disk group version
d. The current disk group version is displayed and no further action is taken
6. Output from the vxprint command displays information stored in the _______.
a. private region
b. public region
c. partition table
d. dirty region log
7. Which command removes the datavol volume from the datadg disk group?
a. vxassist -g datadg remove volume datavol
b. vxremove -g datadg remove volume datavol
c. vxassist -g datadg destroy volume datavol
d. vxdg -g datadg destroy volume datavol
8. Which statement is true about the relationship between a Volume Manager volume and the corresponding file system?
a. Starting the volume will start the file system
b. The file system must be mounted to stop the volume
c. The file system must be unmounted to stop the volume
d. Starting the file system will start the volume
9. Which command forces the daemon to reread all the drives in the system?
a. kill ?HUP vxiod
b. vxdisk rescan
c. vxdctl enable
d. vxprint -voldstart
10. Which command can be used to remove a disk interactively?
a. vxdiskadm option "remove a disk"
b. vxdisk remove -f -i
c. vxdisk relocate -f -i
d. vxdiskadm option "relocate subdisks"
11. The datadg disk group contains four disks. A 100 MB volume named datavol is concatenated using two disks. There are no other volumes in the disk group. There are three processes performing random reads that are 512K in size on the volume. Output from a vxstat command indicates that all I/O activity occurred mostly on one of the two disks. Which action will most evenly distribute the I/O across all disks in the disk group?
a. Mirror the volume using the two unused disks
b. Remove the two unused disks from the disk group
c. Resize the volume to use all of the disks
d. Stripe the volume across all four disks
12. Which layout options are available when using the vxassist command to create a layered volume? (Choose two)
a. concat-mirror
b. mirror-concatenate
c. stripe-mirror
d. mirror-stripe
e. concatenate-stripe
f. stripe-concatenate
13. Which menu option within the vxdiskadm utility can be used to create a new disk group?
a. Add or initialize one or more disks
b. Add or initialize one or more disk groups
c. Make a disk available for hot-relocation use
d. Enable access to (import) a disk group
14. What are the benefits of enclosure-based naming? (Choose three)
a. easier fault isolation
b. improved array availability
c. device-name independence
d. improved SAN management
e. improved disk performance
15. Which command creates a 10 GB volume named datavol in the datadg disk group, assuming that the /etc/default/vxassist file does NOT exist on the system?
a. vxvmvol -g datadg new 10g datavol
b. vxvol -g datadg create 10g datavol
c. vxassist -g datadg make datavol 10g
d. vxdisk -g datadg 10g newvol datavol
16. Online resizing of a Volume Manager volume and file system requires that the file system is _____.
a. in the bootdg disk group
b. checked before the process
c. shared across disk groups
d. mounted during the process
17.Which command displays the contents of the volboot file?
a. vxvolboot list
b. vxcat volboot
c. vxdctl list
d. vxconfig volboot
18. What are the characteristics of a space-optimized snapshot? (Choose two)
a. contains compressed primary data
b. references the primary data
c. requires less space than a full-sized point-in-time copy
d. initially contains a complete copy of primary data
e. performs an automatic atomic-copy resynchronization
19. Which Veritas Volume Manager command displays average volume read and write times?
a. vxprint
b. vxstat
c. vxtrace
d. vxinfo
20. What is the recommended next step to be performed after a failed disk has been physically replaced?
a. Logically replace the disk in volume Manager
b. Unrelocate any relocated Volume Manager subdisks to the new disk
c. Synchronize any STALE plexes
d. Ensure that the operating system can access the disk
21. Which commands can be used to manage dynamic multipathing? (Choose two)
a. vxddladm
b. vxdiskadm
c. vxdmpadm
d. vxpathadm
e. vxassist
22. Which task related to protecting the Volume Manager configuration has the steps "precommit" and "commit" associated with it?
a. restore
b. backup
c. replace
d. remove
Answers: 1-d, 2-d,3-a, 4-b, 5-c, 6-a, 7-a, 8-c, 9-c,10-a,11-d, 12-a&c, 13-a, 14-a&c&d, 15-c, 16-d, 17-c, 18-b&c, 19-b, 20-d, 21-b&c, 22-a