Apostrophe 1.5: legacy documentation

Still using Apostrophe 1.5? Check out Apostrophe 2 for your future projects.

Scaling

Scaling Apostrophe

Do I Need This Guide?

Probably not. We have sites receiving over 500,000 page accesses per month that do not implement the techniques here. You can go a long, long way by choosing an appropriately sized VPS or dedicated server, and then, if necessary, by applying our  performance tips. In fact, if you combine that setup with a service like  http://cloudflare.com, you may be able to handle the "Oprah effect" of national press coverage without any further work. So don't even think about load balancing or S3 unless you know for a fact that your storage requirements or traffic levels are very high indeed AND you

Of course the performance tips document only covers scalability for traffic purposes. Scaling up media storage is a separate issue. Most clients won't have more than a few GB of photos and office documents, but yours might have a requirement to store 300GB of high-resolution originals on the site. You might find yourself choosing to follow this document simply to expand your storage by using Amazon S3 as an alternative to the filesystem.

A third case: you might choose to pursue load balancing purely for the sake of high availability. RAID storage and frequent backups provide high reliability, but if a VPS goes down it can take some time to restore your configuration and data from backups. Depending on the needs of your project this may be sufficient reason to pursue load balancing even if you do not otherwise need extra capacity, since load balancing makes it possible to "keep right on truckin'" when one server goes down.

Before You Start

  • Your apostrophePlugin must be up to date with the 1.5 stable svn branch. Much of our work on scalability is ongoing. If you are not tracking the 1.5 stable branch, preferably by using svn externals, please don't ask us why things don't work.
  • After you update your apostrophePlugin, run apostrophe:migrate before continuing. This will ensure you have the tables needed for the MySQL-backed cache recommended here. Make sure you run it for each server environment with a distinct database.

Scaling It Yourself

If cloudflare meets your needs, you're done. If not, read on.

In a nutshell, the issue with scaling up a website is that you eventually overwhelm the single server that is operating it.

Once you have exhausted reasonable steps to upgrade that single webserver (a larger VPS, a dedicated server, and the application of the sensible steps spelled out in  performance tips) you will discover a need to break up the components of your website and distribute them over several servers.

The most obvious step is to put the MySQL server software and the Apache webserver software on separate servers. This is easy to do, just edit your database.yml files accordingly. Make sure the two servers are very close on the network. For acceptable performance they must be able to communicate over your web hosting company's local network, not the open Internet. Otherwise round-trip delays to fetch information from the database server, make decisions, and fetch more information will hurt performance more than dividing the servers will help.

But while this step is worthwhile, you will quickly notice that it doesn't help all that much by itself. The reason is that a CMS like Apostrophe spends most of its time executing PHP code and very little time processing actual SQL queries on the MySQL server.

The solution is to split up the workload of the webserver among several webservers in a load-balanced configuration.

Sandbox Project

You can check out a branch of the Apostrophe 1.5 sandbox that is configured for S3. This may help resolve any questions about your settings and we refer to this project when attempting to verify any problem reports regarding this document:

svn co http://svn.apostrophenow.org/sandboxes/asandbox/branches/1.5-with-s3 demo15-s3

Load Balancing and Apostrophe

There are many good sources on the subject of load balancing in general. This documentation is specifically about Apostrophe So I won't be covering the basics of how to set up several VPSes or dedicated servers and run load balancing software on a "front end" server that splits the incoming HTTP requests between several VPSes. However, check out Amazon's  Elastic Load Balancing and  Elastic Compute Cloud options, among many others.

Here I'll focus on the specific issues that may make it challenging to load-balance Apostrophe. Without proper attention to all of these issues, you may appear to be initially successful but could wind up with confused user sessions, damaged page trees (a very bad thing) or missing media on your pages:

  • Sessions are often stored in files on a non-shared filesystem
  • Caches are often stored in files on a non-shared filesystem
  • Search engine indexes are often stored in files on a non-shared filesystem
  • Locks that preserve the page tree's integrity may be applied to files on a non-shared filesystem
  • Media items (such as photos) are often stored in files on a non-shared filesystem

I'll look at all of these issues and explore different solutions. But first let's look at a shortcut that may work well for many use cases.

A Shortcut: Traditional Network File Systems

Traditional network file systems attempt to provide the same exact features as the filesystem on your original single server, including the ability to lock files and guarantees of consistency between requests that come in at different times. To the extent that these guarantees actually work while providing acceptable performance, network file systems may be an acceptable solution for you.

This approach can work well provided that:

  • The network file system is robust enough
  • You are willing and able to spend time configuring and managing network file systems
  • You need to handle high traffic but are less concerned about storing huge numbers of media files (your site is popular, but not enormous)

You could choose to mount your entire project folder over the network filesystem, which may simplify deployment. If not, the two folders that absolutely must be shared via the network filesystem are data/a_writable and web/uploads.

There are two common network file systems at the moment in the Linux world: NFS and GlusterFS. The NFS filesystem is an older solution which we do not use in-house. Not all versions of NFS correctly support the flock() system call provided by PHP, which Apostrophe uses to lock files to prevent conflicts and avoid the loss of the entire page tree (a very bad thing, in case you have not guessed).

The  GlusterFS documentation claims there is support for flock() calls.

It is possible to configure Apostrophe to use  pkLockServer, a pure PHP, socket-based distributed lockserver. A suitable class is provided with Apostrophe and can be enabled in app.yml:

all:
  a:
    lock_class: aLockPkLockServer
    lock_options:
      host: localhost
      port: 20934

You must have an instance of pkLockServer running to connect to, of course. See the pkLockServer documentation for more information.

We formerly offered a lockserver option based on MySQL's GET_LOCK function, however this proved not to work well for PHP web requests because PDO seems to pool connections among many requests. In any event, GET_LOCK kept giving out the same lock to multiple web requests simultaneously, which is not good! pkLockServer is a better option as long as a reliable backbone network connects all of the frontend servers to the pkLockServer instance.

Thanks to recent refactoring you are not limited to the flock() and pkLockServer implementations. You can write your own subclass of the new aLock interface and activate it using app_a_lock_class and app_a_lock_options as shown above.

(Almost) Everything in the Database

We prefer to store everything other than large files in the database. Databases are really great at storing data, locking things and coping with requests from many different frontend servers. If it's under a megabyte in size, it belongs in a database.

Here's how we accomplish this for Apostrophe.

Caches in the Database

Depending on the type of cache, you could arguably keep these in the filesystem because it's "only a little redundant" (one copy per frontend server). Eventually each frontend server has a copy in cache. But if you're using the cache to store sessions or to keep an S3 cache consistent as described later, this won't do at all. So we recommend moving your Apostrophe caches to the database by configuring app.yml appropriately:

all:
  a:
    cache_default_class: aMysqlCache

For better performance you can use aMongoDBCache which has the same behavior as aMysqlCache but is 10x faster. Of course you must have the mongodb pecl extension and mongodb itself installed to use this:

all:
  a:
    cache_default_class: aMongoDBCache
    cache_default_options:
      # change this to a shortname for *your* site
      database: shortsitename
      collection: cache
      uri: mongodb://some/shared/mongodb/installation/somewhere

Naturally you need the mongodb pecl extension and mongodb itself to use aMongoDBCache.

By default, all of Apostrophe's caches are implemented as subclasses of sfFileCache with various names in data/a_writable. This is a bad idea with load balancing. By specifying a different cache class we avoid this issue. The aMysqlCache class uses the main database of the site, which of course is already shared. The aMongoDBCache class can be configured to point to a shared installation (if you skip the uri it talks to localhost, port 27017).

Yes, you can use other subclasses of sfCache here. However, they must support the prefix option properly. Be aware that this is the only option they will receive. You can address that by subclassing whatever cache class you prefer to get the rest of its options from app.yml. You can also explicitly configure each individual cache, but you almost certainly don't need to.

Sessions in the Database

There are many ways to share user session information. You could use PHP's memcached support, in which case Symfony won't know anything has changed. Or you could use PHP's various alternative session storage classes. Or you could configure your load balancer to provide "session affinity," in which traffic for a given user is always mapped to the same server.

There's nothing wrong with memcached per se, but it may go down and is one more thing to administer. Session affinity works, but it means that if a particular webserver goes down, any sessions on that server are lost. Our preferred solution is to use sfCacheSessionStorage, a standard Symfony offering that stores sessions in the cache. By combining this with aMysqlCache as discussed earlier we get robust sessions that live in the main database seen consistently by all webservers.

Just do this in factories.yml:

all:
  storage:
    class: sfCacheSessionStorage
    param:
      session_name: sfproject #[required] name of session to use
      session_cookie_path: / #[required] cookie path
      # session_cookie_domain: # cookie domain, should start with . if it is an entire domain and not a single site
      session_cookie_lifetime: +30 days #[required] lifetime of cookie
      session_cookie_secure: false #[required] send only if secure connection
      session_cookie_http_only: true #[required] accessible only via http protocol
      cache: 
        class: aMysqlCache
        param:
          prefix: session

Note the various options to specify the domain, name and lifetime of the session cookie. This is separate from PHP's built-in session handling.

Cloud-Friendly Image Scaling

Apostrophe normally uses netpbm rather than gd in some situations to use less memory when scaling images. However netpbm is not particularly cloud-friendly and would require copying files to the local hard drive first, which is undesirable. Rather than implement support for that we've chosen to mandate the use of gd when using the cloud for storage. You can specify this in app.yml:

  aimageconverter:
    # For the cloud force the use of gd, piping stuff to a file and then to 
    # netpbm is not really a win anymore
    netpbm: false

Search Engine Indexes in the Database

By default, sites built with Apostrophe before recent updates to our sandbox use the Zend Lucene search engine, which stores its indexes in data/a_writable/search_indexes. That won't work unless that folder is shared, and for performance reasons we strongly recommend moving away from Zend Lucene anyway.

To address this issue,  migrate your project's search feature to apostropheMysqlSearchPlugin. Complete documentation is available on that page.

Locks in the Database

DEPRECATED: we formerly supported using MySQL's GET_LOCK function for locks, however this considers multiple web instances of PHP to be a single client in our tests, which makes it ineffective. If you specified the old app_a_lock_type option you will now get flock() instead unless you configure pkLockServer as described above or provide your own implementation of aLock.

Media and Other Big Files in S3

This is the big one. If your media storage needs are easy to accommodate you might be OK at this point as long as you use a shared filesystem such as NFS or GlusterFS. But S3 is a big win both in terms of scalability and in terms of performance for typical end users viewing pages and media because it offloads all of the traffic that doesn't truly require PHP's attention.

Before You Begin

 Go read about Amazon Simple Storage service (S3). It's important that you understand the basics. I'll wait.

You're back! Good. Now create an Amazon Web Services account for your project and obtain your "access key id" (aka "key") and your "secret access key" (aka "secret key") from the "security credentials" page of the AWS site.

Now you're able to create and manage buckets on S3 in which to store your files. Create two buckets: one for your public files (which formerly lived in web/uploads) and another for your private files (which formerly lived in data/a_writable).

Decide what S3 region your content will live in. You should use a region that supports "read after write" consistency, which ensures that if a file has been written to by one webserver, another webserver will immediately be able to read from it. Otherwise media will behave unpredictably. The default "US East" region, which is misnamed and is really a general US region, DOES NOT provide read after write consistency. As of this writing, all other S3 regions do provide read after write consistency. If you don't know what S3 regions are available please read Amazon's AWS documentation.

Finally, create two buckets for your site. One will be used for public files that are web accessible, the other for private files that are not web accessible. Take note of the bucket names, you'll need them soon.

Installing aS3StreamWrapper: the Amazon S3 Stream Wrapper

Out of the box, PHP doesn't know how to talk to S3. Fortunately PHP supports "stream wrappers," which teach PHP's standard functions like fopen and file_put_contents how to talk to different types of storage.

 We have created aS3StreamWrapper, an Amazon S3 stream wrapper for PHP. There are other stream wrappers, but they don't support storing files in subdirectories, only a single, flat level of files per bucket. Our wrapper includes support for subdirectories. It also provides a cache to provide much quicker answers to questions like "does this file exist" and "what is in the first 8K of the file," which avoids significant performance overhead when working with S3, especially from outside of Amazon EC2. And it comes with an extensive suite of unit tests. Despite being suitable for use as a Symfony plugin, the wrapper itself has no dependencies on Symfony at all, and we welcome its use in other PHP based software.

Our stream wrapper has been released on github where a wider audience of PHP developers can contribute and improve it, however thanks to github's svn support you can easily bring it into your project via svn externals.

To install the plugin, check it out to the aS3StreamWrapperPlugin subdirectory of your plugins directory. We strongly recommend doing this by editing your svn externals if your project is already in svn.

Here is the svn path to the plugin:

http://svn.github.com/punkave/aS3StreamWrapper.git

So to add it to your svn:externals use:

cd plugins
svn propedit svn:externals .

And add this line:

aS3StreamWrapperPlugin http://svn.github.com/punkave/aS3StreamWrapper.git

To enable the plugin, add it to the list of plugins for your project in config/ProjectConfiguration.class.php and add a configureDoctrineConnectionDoctrine method to your ProjectConfiguration class, if you do not already have one. That method is called by Doctrine right after the database connection becomes available, which is necessary to ensure the cache is ready to talk to.

In that method you'll need the following code to register the stream wrapper for two protocols, s3private and s3public, referring to your private and public files respectively:

  // Known Symfony issue: this method is sometimes called more than once
  static protected $first = true;
  
  public function configureDoctrineConnectionDoctrine($conn)
   {
    // sfAutoloadAgain tries hard to arrange things so it still works if it is not
    // already the last autoloader, but by modifying the autoloader chain while it is
    // being looped over it causes segfaults. Work around it by unregistering and
    // re-registering sfAutoloadAgain
    if (sfConfig::get('sf_debug'))
    {
      sfAutoloadAgain::getInstance()->unregister();
    }

    if (!self::$first)
    {
      return;
    }
    self::$first = false;
    $wrapper = new aS3StreamWrapper();
    $wrapper->register(array('protocol' => 's3private', 'key' => sfConfig::get('app_s3_key'), 'secretKey' => sfConfig::get('app_s3_secret_key'), 'acl' => AmazonS3::ACL_PRIVATE, 'cache' => aCacheTools::get('s3privateCache'), 'region' => sfConfig::get('app_s3_region')));
    $wrapper->register(array('protocol' => 's3public', 'key' => sfConfig::get('app_s3_key'), 'secretKey' => sfConfig::get('app_s3_secret_key'), 'acl' => AmazonS3::ACL_PUBLIC, 'cache' => aCacheTools::get('s3publicCache'), 'region' => sfConfig::get('app_s3_region')));
    if (sfConfig::get('sf_debug'))
    {
      sfAutoloadAgain::getInstance()->register();
    }
   }

Configuring Apostrophe for S3

Now we can edit app.yml to specify our AWS credentials and the Amazon region we'd like to use. To do that we specify app_s3_key, app_s3_secret_key, and app_s3_region. (These are not under app_a because they might be used for non-Apostrophe purposes.)

We also must specify the path where all asset files (such as CSS and JS files) will be kept in S3. The apostrophe:sync-static-files task uses this path to copy asset files from your local system to S3 for deployment. When specified, this path is substituted for the web root folder when storing files there. This is the app_a_static_dir setting. We'll also specify the URL that points to the same place from the browser's point of view. This is the app_a_static_url setting.

Also we'll specify the path where private large files (such as temporary storage of uploaded files) can be kept. This is the app_aToolkit_writable_dir setting. An exception to this is a location for temporary files not needed for longer than a single HTTP request, which can be in the local filesystem and should be for performance. This is the app_aToolkit_writable_tmp_dir setting.

We also need to specify where uploaded media should be stored in S3. We do this with app_aToolkit_upload_dir. The URL corresponding to this is currently always /uploads, appended to app_a_static_url if set, otherwise to the relative URL root in the usual Symfony way. You'll also want to specify sf_upload_dir to match app_aToolkit_upload_dir for compatibility with non-Apostrophe code that needs to know where to store uploads.

(The distinction between app_a and app_aToolkit is present for historical reasons. Sorry for the annoyance.)

Finally, we also turn off app_a_copy_assets_then_rename, a safety provision that is unnecessary with S3 and doubles the time needed to copy compiled LESS files to S3.

All together it looks like this in app.yml. In my examples below I've used the all: key, however you can and should specify separate settings for dev: and prod: so that you don't overwrite your production media when testing in a development environment. I recommend keeping them well apart in separate S3 buckets.

all:
  s3_key: 'MYKEYFROMS3'
  s3_secret_key: 'MYSECRETKEYFROMS3'
  s3_region: 'us-west-1'
  a:
    static_dir: s3public://mypublicbucketname
    static_url: http://mypublicbucketname.s3.amazonaws.com
    copy_assets_then_rename: false
  aToolkit:
    upload_dir: s3public://mypublicbucketname/uploads
    writable_dir: s3private://myprivatebucketname
    writable_tmp_dir: SF_DATA_DIR/a_writable/tmp

And in settings.yml, just to make sure non-Apostrophe Symfony code knows where to store uploads:

    upload_dir: s3public://mypublicbucketname/uploads

Be sure to symfony cc at this point.

You're nearly ready to go! The next step is to push your asset files up to S3. If you have created your buckets properly and specified your options correctly, this command will copy asset files from your local web folder and its subdirectories up to your public S3 bucket. This can take a long time, S3 is highly optimized for read access but not especially fast for write access especially if you are not running on EC2.

Sync the asset files with this command:

./symfony apostrophe:sync-static-files

Since this is your first push, if you have existing media files on your site, you can push those up as well (ONLY ON THE FIRST SYNC, NEVER LATER):

./symfony apostrophe:sync-static-files --sync-uploads

Again: you DO NOT want to sync uploads ever again after you first move a site to S3. If you do, you will wind up removing all the new media that have been added to the site since you moved to S3! This option is purely for migrating an existing site's media content to S3.

However it is safe to run the task again without the --sync-uploads option to push up new assets that have become available in your web folder. This is done as necessary as your project's assets change or new asset files are added to Apostrophe. Behind the scenes the operation is similar to using rsync but keep in mind that it is not especially fast.

Verifying Your S3 Migration

Phew! You've completed the migration to S3... or have you? To verify the result, symfony cc and visit your site.

If assets such as CSS files are not working properly, check your app.yml settings. If media uploads don't work properly, check the app.yml and settings.yml settings described above. Also make sure you completed the migration to aMysqlCache.

Now "View Source" and make sure that CSS and JS files are loading from your Amazon S3 bucket's public web URL, not from your local webserver.

src attributes pointing to media files will point to your webserver on the first load. This is because the site has not had a chance to generate scaled versions of these files in S3 yet. When your browser requests these local URLs, the scaled version is generated and pushed to S3, and the browser is redirected to the image on S3 (not for the whole page, just the image). So if you refresh that page later you'll see that image URLs now point directly to S3.

Load Balancing: Conclusion

You can now scale up your Apostrophe site much further than before. Since Apostrophe spends more time executing PHP code than waiting for the database you can scale things up quite a bit at this point without encountering a limit on the traffic you can handle just by adding more sites to your load balancer. And the use of S3 makes your potential for media storage almost unlimited. Just don't forget to pay your AWS bills and be sure to monitor your usage.

Further Horizons: Scaling the Database

The CPU time spent running PHP code is the first bottleneck encountered when it comes to scaling up Apostrophe sites. Not much of a bottleneck - again, serving 500,000 pageviews with a single server is pretty darn good. But eventually you do encounter it and it makes sense to scale by load balancing multiple webservers.

Eventually, if you continue down this road with an extremely popular site, you may reach a point where the database becomes the breaking point.

It is tempting to use MySQL's replication feature to address this. Here one relieves the stress on the system through a different kind of load balancing, master-slave replication. This allows all read requests (SELECT queries) to be farmed out to one or more "slave" SQL servers while all write requests (INSERT, UPDATE and DELETE queries) are handled by the master. This greatly reduces the load on the master and also provides a pool of "hot spares" to be swapped with the master in the event of a crash.

Unfortunately, Doctrine 1.2 and Apostrophe are not especially well set up to pick the right master or slave connection for a given query on their own. Apostrophe in particular tends to use the default connection heavily. But even if this were not the case, there would be no guarantee that the server you read from had "caught up" with the server you just wrote to, which can lead to dangerous inconsistencies and the loss of content.

A simpler solution which is likely to be the real-world answer for many is  Amazon Relational Database System (RDS), a MySQL implementation in the Amazon Web Services cloud which can be scaled up a long way without the need for separate connections to separate databases.

One can still use replication to provide read-only slaves for backup purposes (running mysqldump) and as hot spares to become the master quickly in the event of a system failure. We just can't recommend pointing your webservers at slaves at this time due to the issue of consistency.

Also, there are services making up your site that can be broken out to separate databases. If your traffic is truly "through the roof," one step you can certainly take to ease the load on your database server is to take user sessions out of the database. Yes, putting them in the database with sfCacheSessionStorage is a neat trick, but feel free to go back to the default session storage as far as Symfony is concerned and configure PHP to use its built-in memcached session support. Similarly, you can use a different cache class ( via app_a_cache_default_class) which is built on memcached or just a separate MySQL server, reducing demand for the main MySQL server.

Note that while sessions are important, they are not as critical as page content, so you can implement them with low-overhead technologies like memcached. Note that Amazon now offers a cloud-hosted, memcached-compatible service.