Monday, October 31, 2011

What's cooking for FreeBSD 7?

The next major release of FreeBSD, version 7, is one of the most significant so far, with amount of new technologies and improvement largest since the introduction of 5.0. Since constantly searching the mailing lists for important changes can be a bit tedious, I've created this (frequently updated) page to list some of the more interesting new things in one place.
FreeBSD 7.0 has been released! I've now started the continuation of this project: What's cooking for FreeBSD 8.
Also useful are the quarterly Status Reports:
If you're interested in how FreeBSD gets developed, you're encouraged to read the mailing lists and developer blogs.

Network stack improvements and cleanup

Even though this document mentions only several people, the effort to improve the network stack and its performance has been carried by many.

New sendfile() implementation, improved sosend()

Status: Committed to -CURRENT
Will appear in 7.0: sure
Author: Andre Oppermann, Robert Watson
Homepage: message
While working on TSO support, Andre Oppermann has found several ways to optimize kernel's internal networking support. The new sendfile() implementation sends larger chunks of data at once and improves performance up to 5x when used with TSO and other new enhancements. Improvements to the sosend() and related functions resulted in lowering the CPU consumption of sending side of network interfaces almost three times. Note that these are microbenchmarks and performance improvements in real usage still needs to be quantified.

TSO and LRO support

Status: Committed or ready for -CURRENT
Will appear in 7.0: sure
Author: Andre Oppermann and Andrew Gallatin
The ongoing effort to improve FreeBSD's network performance (especially after the hit taken during transition to SMP) has resulted in the new ability to support TSO (TCP/IP segmentation offload) and LRO (Large Receive Offload) hardware on gigabit and faster cards. Some of the drivers that support TSO include: em, bc, cxgb, ixgbe, msk, mxge, nxge, nfe, re (or in plain words: Intel, Broadcom, NVidia, Realtek and other cards, gigabit or better). LRO support is currently in mxge.

TCP socket buffers auto-sizing

Status: Partially committed to -CURRENT
Will appear in 7.0: sure
Author: Andre Oppermann
FreeBSD has a default 32K send socket buffer. This supports a maximal transfer rate of only slightly more than 2Mbit/s on a 100ms RTT trans- continental link. Or at 200ms just above 1Mbit/s. With TCP send buffer auto scaling and the default values below it supports 20Mbit/s at 100ms and 10Mbit/s at 200ms. Both read and write buffer are auto-sized.
While the support for send buffers auto sizing is committed, patches for receiving side are still under testing.

Rapid Spanning Tree Protocol (802.1w)

Status: Committed to -CURRENT
Will appear in 7.0: sure
Author: Andrew Thompson
RSTP provides faster spanning tree convergence. The protocol will exchange information with neighboring switches to quickly transition to forwarding without creating loops. The code will default to RSTP mode but will downgrade any port connected to a legacy STP network so is fully backward compatible.

SCTP (Stream Control Transmission Protocol)

Status: Committed to -CURRENT
Will appear in 7.0: sure
Authors: Randall Stewart, George Neville-Neil
FreeBSD is the reference implementation for the SCTP.
Like TCP, SCTP provides a reliable transport service, ensuring that data is transported across the network without error and in sequence. Like TCP, SCTP is a session-oriented mechanism, meaning that a relationship is created between the endpoints of an SCTP association prior to data being transmitted, and this relationship is maintained until all data transmission has been successfully completed.
Unlike TCP, SCTP provides a number of functions that are critical for telephony signaling transport, and at the same time can potentially benefit other applications needing transport with additional performance and reliability.

Link aggregation / trunking

Status: committed to -CURRENT
Will appear in 7.0: sure
Author: Reyk Floeter (from OpenBSD)
Manpage: lagg(4)
OpenBSD's trunk(4) was imported to FreeBSD in time to be shipped in FreeBSD 7.0. The trunk interface allows aggregation of multiple network interfaces as one virtual trunk interface for the purpose of providing fault-tolerance and high-speed links. The driver currently supports the trunk protocols failover (the default), feclacploadbalance,roundrobin, and none.

Improvements to kernel facilities

PMC performance monitoring

Status: Available in -CURRENT, partially available in RELENG_6
Will appear in 7.0: sure
Author: Joseph Koshy
This project implements a kernel module (hwpmc(4)), an application programming interface (pmc(3)) and a few simple applications (pmcstat(8) and pmccontrol(8)) for measuring system performance using event monitoring hardware in modern CPUs.
Some parts (hwpmclibpmcpmcstat) were developed even before RELENG_6 was branched and new development goals for 7.x include support for callgraphs and a GUI front end.

Interrupt filtering

Status: Mostly committed to -CURRENT
Will appear in 7.0: sure
Author: Paolo Pisati
Homepage: wiki page
Interrupt filtering is a new method to handle interrupts in FreeBSD that retains backward compatibility with the previous models (FAST and ITHREAD), while improving over them in some aspects. With interrupt filtering, the interrupt handler is divided into 2 parts: the filter (that checks if the actual interrupt belongs to a device) and a private per-handler ithread (that is scheduled in case some blocking work has to be done). The main benefits of this work are:
  • Feedback from filters (the operating system finally knows what's the state of an event and can react consequently).
  • Lower latency/overhead for shared interrupt line.
  • Previous experiments with interrupt filtering showed an increase in performance against the plain ithread model in some cases.
  • General shrink of the machine dependent code - part of the interrupting handling code was turned into machine independent code.

Linuxulator for Linux 2.6

Status: Committed to -CURRENT
Will appear in 7.0: sure
Authors: Alexander Leidinger, Roman Divacky
Homepage: blog postcvs commit note
FreeBSD includes support for natively executing Linux binaries. This is done via runtime translation of Linux syscalls to BSD syscalls, with no performance penalty. The facility is colloquially called the "linuxulator".
Linuxulator in -CURRENT has been updated to run binaries made for Linux 2.6.16 (though the default for 7.0 will still be 2.4), and the official Linux environment will be Fedora Core 5.

New scheduler: ULE 2.0 / 3.0

Status: Committed to -CURRENT
Will appear in 7.0: sure
Author: Jeff Roberson
Homepage: CVS file referencecommit messagedescription
The original SCHED_ULE was under-performing and buggy, so it got reworked. The new scheduler replaces, and has the same name as, SCHED_ULE, but is of a somewhat different architecture. It replaces the double queue mechanism with circular queues, and fixes a lot of other things, but it's still an O(1) scheduler with per-CPU queues.
During SCHED_ULE 2 development there was a brief period where there was a third (or fourth, depending on how you count) scheduler, named SCHED_SMP, forked from SCHED_ULE 2 and heavily optimized for configurations with large number of CPUs (8+). This SCHED_SMP has been renamed and committed as SCHED_ULE. While the new scheduler will really shine for multi-CPU machines, it's now also recommended for single processor systems as it has much better interactive performance (mixing of processes with different requirements for IO vs CPU time). ULE will not be enabled by default for 7.0 but it's an officially recommended performance optimization.

Improved accounting file format

Status: Committed to -CURRENT
Will appear in 7.0: sure
Author: Diomidis Spinellis
Manpage: acct(5)
The accounting record format has been revised to store time values with microsecond precision. This allows the recording of meaningful values for short-running commands on modern fast processors. The adoption of the IEEE 754 float format for storing time and usage values greatly increases their range and precision, and also simplifies the processing of accounting records by third party tools. The new record format and the tools lastcomm(1) and sa(8) maintain backwards compatibility with the original accounting format.

Storage subsystems' improvements


Status: Committed to -CURRENT
Will appear in 7.0: sure
Author: Pawel Jakub Dawidek
Homepage: announcement messagecommit announcement message
Sun's ZFS is in the process of being ported to FreeBSD, with the intention of offering most (or all) features found in the original implementation. It's integrated with FreeBSD's existing features like UFS and GEOM, thus offering the possibility of creating FreeBSD UFS file systems on ZFS volumes, and using GEOM providers to host ZFS file systems.
ZFS is an advanced file system (actually, a combination of file system and volume manager) with many interesting features built-in: snapshots, copy-on-write, dynamic striping and RAID5, up to 128-bit file system size (limited to 64 bits in practice even in Solaris - there's no 128-bit integer type in standard C language), and globally optimal I/O sorting and aggregation. It's marked EXPERIMENTAL in 7.0-RELEASE.
ZFS is still experimental on FreeBSD, and it's recommented that users get familiar withFreeBSD ZFS documentation before using it. For a more light-hearted introduction see this presentation by Pawel.


Status: Committed to -CURRENT
Will appear in 7.0: sure
Authors: Julio M. Merino Vidal, Rohit Jalan, Howard Su, Glen Leeder
Homepage: TMPFS page on FreeBSD wikiTMPFS at NetBSD
TMPFS is a memory file system designed to efficiently allocate (and deallocate) memory used for the file system itself, as contrasted to the "usual" way of creating memory file systems by creating memory storage devices ("RAM drives"). It's marked EXPERIMENTAL for 7.0-RELEASE.


Status: Committed to -CURRENT
Will appear in 7.0: sure
Author: Pawel Jakub Dawidek
Homepage: message
Gjournal is a GEOM storage class that provides data journaling facilities to any providers (and consumers) the user needs. As a special case it has support in UFS file system code, and in this combination it makes UFS a journaled file system. In itself, gjournal consumes two devices (one for the data, one for the journal) and provides one. Since it takes special care to work well with disk drive hardware caches, it can be used to accelerate and provide reliability in many other uses, such as GELI and GBDE encrypted device providers.
I'm proud to say current gjournal is a continuation of my idea implemented for Google's Summer of Code 2005.


Status: Committed to -CURRENT
Will appear in 7.0: sure
Author: Ivan Voras
Gvirstor is a GEOM storage class that provides a storage device of arbitrary size in "overcommit" mode (i.e. larger than physically available storage). Providers can be added to the virstor device on-line (while used, e.g. mounted), and removed if unused and at the end of the list of components.
This work was created by me, with Pawel Jakub Dawidek as mentor and sponsored by Google in Summer of Code 2006.


Status: Committed to -CURRENT
Will appear in 7.0: sure
Author: Matt Jacob
Homepage: CVS message
Gmultipath allows failover between multiple devices that represent the same storage device. This is an active/passive{/passive...} arrangement that has no intrinsic internal knowledge of whether devices it is given are truly multipath devices. As such, this is a simplistic approach, but still a useful one. The first of N identical devices (and N *may* be 1!) becomes the active path until a BIO request is failed with EIO or ENXIO. When this occurs, the active disk is ripped away and the next in a list is picked to (retry and) continue with.

New platforms

New platform: ARM architecture

Status: Committed to -CURRENT, MFC-ed to RELENG_6
Will appear in 7.0: sure
Authors: Olivier Houchard, Warner Losh & more
Support for ARM embedded architecture has been under development since 6.0, enabling FreeBSD presence in the embedded markets.
The support is now MFC-ed to 6.x and is available in 6.2-RELEASE. It's still under development and will likely support more boards in the future.

New platform: sun4v (Niagara / T1)

Status: Committed to -CURRENT
Will appear in 7.0: probably
Authors: Kip Macy, John Birrell & more
Homepage: CVS announcement
There's still a long way to fully supporting Sun's Niagara/sun4v platform, but progress is slowly being made. Niagara offers advanced features such as eight cores and 32 threads per CPU, and hardware public key cryptography acceleration. Unfortunately, this architecture is not supported out-of-the-box in 7.0.

Security features

Security event auditing

Status: Committed to -CURRENT, MFC-ed to RELENG_6
Will appear in 7.0: sure
Authors: Robert Watson & more
Event auditing allows the reliable, fine-grained, and configurable logging of a variety of security-relevant system events, including logins, configuration changes, and file and network access. These log records can be invaluable for live system monitoring, intrusion detection, and postmortem analysis. FreeBSD implements Sun's published BSM API and file format, and is interoperable with both Sun's Solaris and Apple's Mac OS X audit implementations.
Audit framework was MFC-ed to RELENG_6 and is available in 6.2-RELEASE.

New privilege separation capabilities

Status: Committed to -CURRENT
Will appear in 7.0: sure
Author: Robert Watson
Homepage: list announcement
This is a framework which can be used together with MAC to creating policies similar to RBAC (as seen in Solaris & others) which allow the root privilege to be separated into several fine grained capabilities such as "can access the network" or "can bypass file system quotas". This is work in progress and no shipped policy modules directly implement all of the functionality yet.

Multimedia features

Hi-def audio

Status: Mostly committed to -CURRENT
Will appear in 7.0: sure
Author: Ariff Abdullah
Newly developed driver, snd_hda has been developed to support professional sound equipment and new hardware on the market. HDA hardware is capable of delivering 192 kHz/32 bit quality for two and 96 kHz/32 bit for up to eight channels. Latency has been reduced for many cases.
Related to this, new drivers for envy24(ht) sound hardware is committed to -CURRENT, and multichannel audio support is due to be finished soon.

Userland enhancements


Status: Committed to -CURRENT
Will appear in 7.0: sure
Author: Jason Evans
The currently used malloc() library, called phkmalloc since its creator is Poul-Henning Kamp, is almost a decade old in its present implementation. It was designed for a time when memory was scarce, the priorities considered in memory allocation were different, and multithreading was still an academic idea. Even so, it's one of the more popular malloc() implementations, used in all BSDs and even some historical Linux distributions.
Because of its inefficiency when used in multithreaded applications running on multiprocessor systems, a new userland memory allocator was created, named jemallocafter Jason Evans, its creator. The improvements in computer speed and memory availability mean that compared to phkmalloc, which only needed to be conservative in memory usage, jemalloc needed to be more sophisticated and account for low-level properties such as CPU cache locality and parallel execution.
The result is an allocator which is optimized for multithreading, using multiple allocation arenas to help concurrency. On single processor systems there's only one arena, while on multi-processor or multi-core systems there are four times as many arenas as there are processors. Allocations are divided into broad classes based on their size and those classes are further subdivided. Benchmarks show that jemalloc does significantly better in multithreaded applications (like MySQL) and for applications that make many small allocations.

Bits & pieces

Authors: many
Here are some additional changes for 7.0 that are not so glamorous or are smaller in scope:
  • Lots of performance improvements on SMP machines (see MySQL read performanceMySQL write performance and BIND performance graphs.)
  • Significantly increased scalability on SMP machines, mainly from extraordinary work done by David Xu (the libthr threading library), Jeff Roberson (scheduler, flock locking), Atillio Rao (improved kernel locking performance) and Robert Watson (file descriptor locking, unix sockets locking and more).
  • Significantly increased network scalability, resulting mostly from switch to direct dispatch of the network stack from netisr. This is especially helpful for 10 Gbit/s NICs and was mainly done by Robert Watson and Kip Macy.
  • GIANT lock has been pushed further back, and almost all kernel subsystems are now finely locked (e.g. VM, VFS, Net). Some of the recent improvements are: locking the CAM subsystem and many SCSI drivers (by Scott Long), and similar locking work has been done on the NFS client and the Firewire implementation.
  • iSCSI initiator (iSCSI target is available in ports)
  • SATA support
  • Read-only access to XFS file systems
  • Added support for MSI/MSI-X extensions to PCI
  • Support for Apple (Mac) hardware is being worked on
  • pf firewall updated to 4.1
  • X.Org 7.2 - things like beryl now work if you have the right drivers
  • gcc 4.2
  • Implemented symbol versioning for many base OS libraries
  • libthr becomes the default threading library

Things that didn't make it

Despite plans and best efforts, some things won't make it into FreeBSD 7.0-RELEASE. These are:
  • SCHED_CORE - Doesn't perform as well as SCHED_ULE2
  • DTrace - MFCed into 7.2.
  • Superpages - MFCed into 7.2.

Of course, this much new technology will need much testing before it's ready for use. You can help by installing a snapshot of -CURRENT and running it on as close to your regular load as possible. Disable debugging features (which are enabled by default during development) before benchmarking.


No comments: