On a yearly basis, a huge number of goods is traded internationally and declared at Customs. Accordingly, there are large incoming and outgoing message flows that need to be processed. At the Dutch Customs, this mostly takes place electronically. In nowadays’ global economy that is running 24 hours a day, 7 days a week, properly and continuously working systems have become fundamental. As such, the presence of constantly available automated systems used to handle declaration processes is highly important for Customs and businesses… read the full case study
Naturally, when dealing with some particular probability distribution that fits to many of your data sets well, one day you will want to learn more about that distribution. What is so specific to it that it works for your data? And, moreover, how can you interpret the distribution parameters?
The good news is: for many probability distributions, the meaning of their parameters is described in the scientific literature. The classic example is the Normal distribution having two parameters: σ (scale) and μ (location). The parameterization of this distribution is pretty easy to understand: as you change the location parameter, the probability density graph moves along the x-axis, while changing the scale parameter affects how wide or narrow the graph is:
However, for quite a few of distributions, modifying the parameters and observing how the graphs change will be of little help for your undetstanding of what those parameters indicate. One of such distributions is the Lognormal model defined as “a continuous probability distribution of a random variable whose logarithm is normally distributed.” What does that mean exactly? For better understanding, compare the CDFs of the Normal and Lognormal distributions:
Normal distribution CDF
Lognormal distribution CDF
As you can see, the Normal model is “included” into the Lognormal in such a way that in the Lognormal model, ln(x) has the Normal distribution with the same parameters (σ, μ) as the original Lognormal distribution. And this is the key point to understand: the parameters of the Lognormal model are not the “pure” scale and location (pretty intuitive in the Normal model), but rather the scale and location of the included Normal distribution.
The same logic applies to the Gamma and Log-Gamma pair of distributions. The classical Gamma has two parameters: α (shape), β (scale), and the Gamma CDF is as follows:
Gamma distribution CDF
The shape parameter indicates the form of the Gamma PDF graph, while the scale factor affects the spread of the curve. Similarly to the Gamma model, the Log-Gamma distribution has two parameters with the same names (α, β), but its CDF has the form:
Log-Gamma distribution CDF
Just like with the Normal & Lognormal analogy, in the Log-Gamma model, ln(x) has the Gamma distribution with the same parameters (α, β) which cannot be treated as the “pure” Log-Gamma shape and scale, but the shape and scale of the included Gamma model.
There are dozens of different probability distributions out there, and even if you use only a couple of them on a daily basis, sometimes it can be hard to remember the meaning of all the parameters. That is why we decided to include a little feature in EasyFit that helps you keep your memory fresh: when moving the mouse pointer over a distribution parameter edit box, EasyFit displays a pop-up hint indicating the meaning of that particular parameter:
Distribution parameter hint
Using this feature, you can better focus on your core analysis rather than the technical details like the ones outlined in this article.
Some time ago, we covered the use of probability distributions and related Excel worksheet functions available in EasyFitXL. When dealing with probability data in Excel, most of the time, you would use those functions to set up your calculations to be performed directly within your workbooks. This approach works well for applications where you need to perform typical probability analysis based on different input data: you modify the data, and Excel recalculates the entire worksheet and updates the associated results.
However, for more advanced applications, you might need to implement some complex logic requiring the use of IF statements, which will make your worksheets too complicated. Of course, you can still use the IF worksheet function, but in reality, you would want to keep your workbooks as simple as possible, which is a good idea if you want to easily get back to your analysis in a month. And that is where the built-in Visual Basic for Application programming language comes in handy: with little programming knowledge, using the VBA functions available in EasyFitXL as well as in the EasyFit SDK, you can create feature-rich probability analysis and Monte Carlo simulation applications implementing the logic of any degree of complexity.
Even though both EasyFitXL and the SDK include a variety of VBA functions, these software packages differ in the feature sets they offer. Initially, EasyFitXL was designed as an Excel add-in that brings the visual distribution fitting feature of EasyFit to Excel. Of course, we could not ignore the integration and data analysis automation capabilities of Excel, so we came up with the following ideology for EasyFitXL: visually fit distributions to data in Excel, and use the results in the most convenient way – either visually, in your worksheets, or in your VBA applications. That is why the VBA functions offered by EasyFitXL allow you to evaluate most common distribution functions (PDF, CDF etc.), calculate distribution statistics (mean, variance…), and generate random numbers from any probability distribution you choose as the model for your data.
On the other hand, the Simulation & Probabilistic Analysis SDK was designed from the ground up as the package targeting software developers and offering a complete range of functions covering the entire feature set of EasyFit. Apart from evaluating distribution functions, calculating statistics and generating random numbers, you can do distribution fitting, perform goodness of fit tests, and even create distribution graphs – all directly from your VBA applications.
Another huge difference is that technically, the SDK offers its functionality through a set of Objects, enabling you to use the object-oriented approach to software development, making your work with large projects more efficient. On the contrary, EasyFitXL employs the functional programming model, offering a separate VBA function for each kind of distribution function and each probability distribution, which is good for short and simple programs.
Overall, depending on your needs, you can use either EasyFitXL or the SDK to implement any kind of data analysis application, ranging from simple probability calculation programs to complex automated data analysis and Monte Carlo simulation systems.
Can EasyFit be used to analyze time series data? To answer this question we recently received from a customer, we will shed some light on the differences between the probabilistic analysis and time series analysis.
When dealing with time series data, you usually have as an input a set of (time, value) data pairs indicating the consecutive measurements taken at equally spaced time intervals. The goal of time series analysis is to identify the nature of the process represented by your data, and use it to forecast the future values of the time series being analyzed.
A widespread application of such an analysis is weather forecasting: for more than a century, hundreds of weather stations around the world record various important parameters such as the air temperature, wind speed, precipitation, snowfall etc. Based on these data, scientists build models reflecting seasonal weather changes (depending on the time of the year) as well as the global trends – for example, temperature change during the last 50 years. These models are used to provide weather forecast for government and commercial organizations. In a typical forecast, the predicted values are not assigned probability: “In May, the maximum daily air temperature is expected to be 22 degrees Celsius.”
In contrast to the predictions based on time series analysis, when performing probabilistic analysis, you get not just a single value as a forecast, but a probabilistic model that accounts for uncertainty. In this scenario, you would obtain a continuous range of values and assigned probabilities. Of course, for real world applications, it is more practical to deal with specific values, so the probabilistic models are used to obtain predictions at fixed probability levels. Considering the above example, a forecast might look like: “In May, the maximum daily air temperature will be 22 degrees Celsius with 95% probability.”
So can distribution fitting be useful when analyzing time series data? The answer depends on the goals of your analysis – i.e. what kind of information you want to derive from your data. If you want to understand the connection between the predicted values and the probability, you should fit distributions to your data (just keep in mind that in this case the “time” variable will be unused). On the other hand, if you need to identify seasonal patterns or global trends in your data, you should go with the “classical” time series analysis methods.
From time to time, we receive emails from our customers asking what parameter estimation methods are implemented in EasyFit to carry out distribution fitting. When designing EasyFit, we were striving for a good balance between the accuracy and speed of calculations. That is why we decided to use the Method of Moments (MOM) for those models that allow for easy use of this method. Some examples of such distributions include the Chi-Squared, Exponential, two-parameter Gamma, and Logistic models. However, for many other distributions, the Method of Moments does not yield closed form expressions for parameter estimates, and in such cases EasyFit uses the Maximum Likelihood Estimation (MLE) method. In addition, for some distributions used in specific industries, such as the Wakeby model, EasyFit employs the Method of L-Moments (LMOM). You can find a detailed list of supported distributions and estimation methods used on our website.
Recently we have released a new version of our SDK. In this update, we have added a new property that lets you obtain the current licensing status of the SDK – for instance, you can determine whether the SDK is currently running in trial mode (using the Evaluation License), and if so, how many days are left until the evaluation period expires.
Consider the following scenario: you are building an application with a modular structure that, apart from its core feature set, provides some additional functionality through a number of modules, or add-ins, which can be installed and enabled on an optional basis. Now, suppose one of these modules uses the simulation or distribution fitting features of the SDK, and you want to give your users an ability to evaluate it prior to making a purchase decision. The new version of the SDK lets you easily integrate this logic into your applications, allowing you to create more flexible solutions that better meet your customers’ needs.
Because risk and uncertainty are a part of literally all areas of our life, with the finance being one of the most important areas, scientifically based risk management methods are gaining more and more popularity among the finance industry professionals. Currency fluctuations affect all businesses dealing with multiple currencies, so having at least some degree of certainty about the future exchange rates can be a significant success factor for any international enterprise. A wide range of currency forecasting methods have been developed, however, not many of them can pretend to be reliable in the long run: most algorithms only work for a short period of time, and need to be tweaked as the market conditions change.
Brijen Hathi, a Research Fellow at the Planetary & Space Sciences Research Institue, performs his own research in the field and publishes the results in the Currency Forecasting Blog. The forecasting methodology employed by Mr. Hathi is in part based on the same techniques used in probabilistic risk analysis. Like with most modern forecasting methods, in this approach, he uses historical data to predict the future, but the big difference here is that he also assigns specific probabilities to the predictions. For example, for a US-based company doing business in the UK, it doesn’t really matter what the exact GBP/USD exchange rate is going to be during the next 30 days, as long as it stays within a specific interval with a high probability (95% or more). Recently Mr. Hathi has published an article highlighting the use of EasyFit to model pricing probability of the Pound Sterling versus the US Dollar from historical data. It is fascinating to see how EasyFit is being used in (what we believe) a truly scientific approach to data analysis, and we hope to see new developments in this area soon.
The software development community struggles with a way to identify if their projects are on-schedule given the inherent risks of constant invention that inevitably has elements of uncertainty and risk. Current practice is for developers to estimate a software project, and attempt to consider (up-front) all variations to get a viable estimate of time and cost. This process is laborious, and even with due rigor, project slip when the realization that estimates versus actual times fail to match. This leads to costly project overruns and lack of trust in future estimates.
As part of the Agile movement for software development, we think there is a better way and are championing the use of Monte-Carlo simulation as a ways of assessing likely progress and dealing with delays as early as possible… read the full case study
What is Cloud Computing?
For some time now, there has been a lot of buzz around cloud computing – the relatively new computing paradigm in which the resources, software, and information are shared on the computer clusters and delivered to the users on demand through the Internet. The idea behind cluster computing is not new: if your applications require a lot of computing resources or impose very strict reliability requirements which cannot be met by a single personal computer or a server, you can link a group of computers into a cluster that will provide a much better performance.
Why Not Build a Cluster Yourself?
Building and maintaining a computer cluster in your organization may have some downsides, such as large upfront investments into technology infrastructure.and high running costs. Of course, there are companies that will do the job of building and managing a computer cluster for you, but anyway, the bottom line is: depending on how loaded your cluster is going to be, it may or may not be economically feasible for your company to run it on-site. For instance, if you need to quickly perform a very CPU-intensive calculation (e.g. render a complex 3D scene), but only once a day, chances are the cluster will not pay off.
And here’s where cloud computing comes into play: you can have access to great computing resources, pay only as you use them, and not worry about the underlying technology infrastructure. These factors combined can provide a great economic benefit, and some major Internet players, including Amazon and Google, are already offering cloud computing platforms for those who want to make their businesses more efficient.
Is Cloud Computing a Good Fit for Risk Analysis?
As one might guess, not just any kind of application can be efficiently run on the cloud. Because at the core of a cloud is a number of computers linked into a cluster, it is very good at processing a large number of independent tasks, such as requests to a web server. That might be the reason why the cloud computing platforms offered by Amazon and Google are mostly used to run websites.
If you consider risk analysis, it looks like an ideal application to be run on the cloud: an input model of several megabytes that can be easily sent to the cloud, a need for huge computational resources to quickly perform Monte Carlo simulation and distribution fitting, sometimes a need for a lot of storage to hold intermediate results, and relatively small-sized analysis results that can be sent back to the users as text and graphics. Add to that the ever-increasing complexity of risk models used across various industries, causing analysts to wait for hours while their simulations are running, and you have a potentially good opportunity to make risk analysis more economically efficient.
To perform further research in this field, we have partnered with Supportex, a technology services company based in Czech Republic, Europe. Supportex has some good experience providing a cloud computing platform for solving problems much more complex than just processing requests to a web server, that is why we have decided to rely on their hardware infrastructure and domain knowledge to run some test applications and see if cloud computing can be of real help in the field of risk analysis. Once we run the tests, we plan to publish the results in this blog, so stay tuned!
Over the last five years, we have been adding new features to EasyFit mostly with business users in mind, but thanks to the nature of the product and the special academic pricing, it has become quite popular among the academic community: a quick search in Google reveals numerous research papers referring to EasyFit, just to name a few:
- “Co-evolution of Social and Affiliation Networks” (University of Maryland, USA) [link]
- “Power laws in top wealth distributions: evidence from Canada” (Brock University, Canada) [link]
- “Duration of Coherence Intervals in Electrical Brain Activity in Perceptual Organization” (RIKEN Brain Science Institute, Japan) [link]
- “Resource Management Schemes for Mobile Ad hoc Networks” (National University of Singapore, Singapore) [link]
- “Modelling the diffusion of innovation management theory using S-curves” (University of London, UK) [link]
It is pleasing to see EasyFit helping researchers in such diverse disciplines get their job done in a more efficient way.