It's been over two years since the last major release of ekg. Ever since the first release I knew that there were a number of features I wanted to have in ekg that I didn't implement back then. This release adds most of them.
Integration with other monitoring systems
When I first wrote ekg I knew it only solved half of the program monitoring problem. Good monitoring requires two things
- a way to track what your program is doing, and
- a way to gather and persist that data in a central location.
The latter is neccesary because
- you don't want to lose your data if your program crashes (i.e. ekg only stores metrics in memory),
- you want to get an aggregate picture of your whole system over time, and
- you want to define alarms that go off if some metric passes some threshold.
Ekg has always done (1), as it provides a way to define metrics and inspect their values e.g. using your web browser or curl.
Ekg could help you to do (2), as you could use the JSON API to sample metrics and then push them to an exiting monitoring solution, such as Graphite or Ganglia. However, it was never really convenient.
Today (2) will get much easier.
Statsd integration
Statsd is
A network daemon that ... listens for statistics, like counters and timers, sent over UDP and sends aggregates to one or more pluggable backend services (e.g., Graphite).
Statsd is quite popular and has both client and server implementations in multiple languages. It supports quite a few backends, such as Graphite, Ganglia, and a number of hosted monitoring services. It's also quite easy to install and configure (although many of the backends it uses are not.)
Ekg can now be integrated with statsd, using the ekg-statsd package. With a few lines you can have your metrics sent to a statsd:
main = do
store <- newStore
-- Register some metrics with the metric store:
registerGcMetrics store
-- Periodically flush metrics to statsd:
forkStatsd defaultStatsdOptions store
ekg-statsd can be used either together with ekg, if you also want the web interface, or standalone if the dependencies pulled in by ekg are too heavyweight for your application or if you don't care about the web interface. ekg has been extended so it can share the Server
's metric store with other parts of the application:
main = do
handle <- forkServer "localhost" 8000
forkStatsd defaultStatsdOptions (serverMetricStore handle)
Once you set up statsd and e.g. Graphite, the above lines are enough to make your metrics show up in Graphite:
Integration with your monitoring systems
The ekg APIs have been re-organized and the package split such that it's much easier to write your own package to integrate with the monitoring system of your choice. The core API for tracking metrics has been split out from the ekg package into a new ekg-core package. Using this package, the ekg-statsd implementation could be written in a mere 121 lines.
While integrating with other systems was technically possible in the past, using the ekg JSON API, it was both inconvenient and wasted CPU cycles generating and parsing JSON. Now you can get an in-memory representation of the metrics at a given point in time using the System.Metrics.sampleAll
function:
-- | Sample all metrics. Sampling is /not/ atomic in the sense that
-- some metrics might have been mutated before they're sampled but
-- after some other metrics have already been sampled.
sampleAll :: Store -> IO Sample
-- | A sample of some metrics.
type Sample = HashMap Text Value
-- | The value of a sampled metric.
data Value = Counter !Int64
| Gauge !Int64
| Label !Text
| Distribution !Stats
All that ekg-statsd does is to call sampleAll
periodically and convert the returned Value
s to UDP packets that it sends to statsd.
Namespaced metrics
In a large system each component may want to contribute their own metrics to the set of metrics exposed by the program. For example, the Snap web server might want to track the number of requests served, the latency for each request, the number of requests that caused an internal server error, etc. To allow several components to register their own metrics without name clashes, ekg now supports namespaces.
Namespaces also makes it easier to navigate metrics in UIs. For example, Graphite gives you a tree-like navigation of metrics based on their namespaces.
In ekg dots in metric names are now interpreted as separating namespaces. For example, the default GC metric names now all start with "rts.gc.". Snap could for example prefix all its metric names with "snap.". While this doesn't make collisions impossible, it should make them much less likely.
If your library want to provide a set of metrics for the application, it should provide a function that looks like this:
registerFooMetrics :: Store -> IO ()
The function should call the various register functions in System.Metrics
. It should also document which metrics it registers. See System.Metrics.registerGcMetrics
for an example.
A new metric type for tracking distributions
It's often desirable to track the distribution of some event. For example, you might want to track the distribution of response times for your webapp, so you can get notified if things are slow all of a sudden and so you can try to optimize the latency.
The new Distribution
metric lets you do that.
Every time an event occurs, simply call the add
function:
add :: Distribution -> Double -> IO ()
The add function takes a value which could represent e.g. the number of milliseconds it took to serve a request.
When the distribution metric is later sampled you're given a value that summarizes the distribtuion by providing you with the mean, variance, min/max, and so on.
The implementation uses an online algorithm to track these statistics so it uses O(1) memory. The algorithm is also numerically stable so the statistics should be accurate even for long-running programs.
While it didn't make this release, in the future you can look forward to being able to track both quantiles and keep histrograms of the events. This will let you track e.g. the 95-percentile response time of your webapp.
Counters and gauges are always 64-bits
To keep ekg more efficient even on 32-bit platforms, counters and gauges were stored as Int
values. However, if a counter is increased 10,000 times per second, which isn't unusual for a busy server, such a counter would wrap around in less than 2.5 days. Therefore all counters and gauges are now stored as 64-bit values. While this is technically a breaking change, it shouldn't affect the majority of users.
I received a report of contention in ekg when multiple cores were used. This prompted me to improve the scaling of all metrics types. The difference is quite dramatic on my heavy contention benchmark:
+RTS -N1 +RTS -N6
Before 1.998s 82.565s
After 0.117s 0.247s
The benchmark updates a single counter concurrently in 100 threads, performing 100,000 increments per thread. It was run on a 6 core machine. The cause of the contention was atomicModifyIORef
, which has been replaced by an atomic-increment instruction. There are some details on the GHC Trac.
In short, you shouldn't see contention issues anymore. If you, I still have some optimizations that I didn't apply because the implementation should already be fast enough.
There are a lot of ways to improve cabal:
ReplyDeletehttp://www.reddit.com/r/haskell_proposals/comments/fqey1/improve_cabal/
Personally, I would put parallel builds lower on the totem pole, and sandboxing at the highest.
I hope we can have at least 2 students working on the cabal infrastructure.
The referenced ticket for parallel cabal builds only seems to talk about making cabal-install parallel. Does the project concern parallelization of Cabal itself as well?
ReplyDeleteAmsay, the project is focused on building packages in parallel as that's where the biggest potential gain is at the moment. Once that's implemented we could look for other opportunities for parallelism.
ReplyDeleteThe ticket also says that "downloads seem to be serialized, again because there is probably little benefit to making multiple connections to the same server." Why is this? Are bandwidth restrictions really so severe?
ReplyDeleteAmsay,
ReplyDeletePresumably a single connection to a server is enough to use all available bandwidth between the client and the server. If we had multiple servers things would be different.
> Build multiple Cabal packages in parallel
ReplyDeleteWhat's or would be the difference to passing ghc-options="+RTS -N2 -RTS" to cabal?
see http://lambdor.net/?p=306
Lambdor,
ReplyDelete-N2 doesn't help unless Cabal uses more than one thread e.g. by calling forkIO, which it doesn't.