ℹ️ Skipped - page is already crawled
| Filter | Status | Condition | Details |
|---|---|---|---|
| HTTP status | PASS | download_http_code = 200 | HTTP 200 |
| Age cutoff | PASS | download_stamp > now() - 6 MONTH | 0.1 months ago |
| History drop | PASS | isNull(history_drop_reason) | No drop reason |
| Spam/ban | PASS | fh_dont_index != 1 AND ml_spam_score = 0 | ml_spam_score=0 |
| Canonical | PASS | meta_canonical IS NULL OR = '' OR = src_unparsed | Not set |
| Property | Value |
|---|---|
| URL | https://www.quantstart.com/articles/calculating-realised-volatility-with-polygon-forex-data/ |
| Last Crawled | 2026-04-14 08:33:41 (1 day ago) |
| First Indexed | 2023-06-28 16:02:16 (2 years ago) |
| HTTP Status Code | 200 |
| Meta Title | Calculating Realised Volatility with Polygon Forex data | QuantStart |
| Meta Description | In this article we carry out an exploratory data analysis of minutely Forex pairs data |
| Meta Canonical | null |
| Boilerpipe Text | In the previous
article
we wrote a Python function which utilised the Polygon API to extract a month of minutely data for both a major (EURUSD) and exotic (MZXZAR) FX pair. We plotted the returns series and looked at some of the issues that can occur when working with this type of data. This article is part of series where we will be creating a machine learning model which uses realised volatility to predict market regime change. In this article we will be learning how to calculate realised volatility. We will also be conducting a exploratory data analysis to prepare the feature set for our machine learning model.
In order to follow along with the code in this tutorial you will need:
Python 3.9
Matplotlib 3.5
Pandas 1.4
Requests 2.27
Seaborn 0.12.
We also recommend that you take a look at our
early career research series
to create an algorithmic trading environment inside a Jupyter Notebook.
In order to calculate realised volatility we first need to obtain and format the data. In the previous article we created Python functions to contact the Polygon API and obtain a month of minutely data for EURUSD and MZXZAR. You can view the full article
here
. Below you will find the code to obtain the data.
import
json
import
matplotlib
.
pyplot
as
plt
import
pandas
as
pd
import
requests
import
seaborn
as
sns
POLYGON_API_KEY
=
os
.
getenv
(
"POLYGON_API_KEY"
)
HEADERS
=
{
'Authorization'
:
'Bearer '
+
POLYGON_API_KEY
}
BASE_URL
=
'https://api.polygon.io/v2/aggs/'
def
get_fx_pairs_data
(
pg_tickers
)
:
start
=
'2023-01-01'
end
=
'2023-01-31'
multiplier
=
'1'
timespan
=
'minute'
fx_url
=
f"range/
{
multiplier
}
/
{
timespan
}
/
{
start
}
/
{
end
}
?adjusted=true&sort=asc&limit=50000"
fx_pairs_dict
=
{
}
for
pair
in
pg_tickers
:
response
=
requests
.
get
(
f"
{
BASE_URL
}
ticker/
{
pair
}
/
{
fx_url
}
"
,
headers
=
HEADERS
)
.
json
(
)
fx_pairs_dict
[
pair
]
=
pd
.
DataFrame
(
response
[
'results'
]
)
return
fx_pairs_dict
def
format_fx_pairs
(
fx_pair_dict
)
:
for
pair
in
fx_pairs_dict
.
keys
(
)
:
fx_pairs_dict
[
pair
]
[
't'
]
=
pd
.
to_datetime
(
fx_pairs_dict
[
pair
]
[
't'
]
,
unit
=
'ms'
)
fx_pairs_dict
[
pair
]
=
fx_pairs_dict
[
pair
]
.
set_index
(
't'
)
return
fx_pairs_dict
def
create_str_index
(
fx_pairs_dict
)
:
fx_pairs_str_ind
=
{
}
for
pair
in
fx_pairs_dict
.
keys
(
)
:
fx_pairs_str_ind
[
pair
]
=
fx_pairs_dict
[
pair
]
.
set_index
(
fx_pairs_dict
[
pair
]
.
index
.
to_series
(
)
.
dt
.
strftime
(
'%Y-%m-%d-%H-%M-%S'
)
)
return
fx_pairs_str_ind
def
create_returns_series
(
fx_pairs_str_ind
)
:
for
pair
in
fx_pairs_str_ind
.
keys
(
)
:
fx_pairs_str_ind
[
pair
]
[
'rets'
]
=
fx_pairs_str_ind
[
pair
]
[
'c'
]
.
pct_change
(
)
return
fx_pair_str_ind
Python
The functions can be called in a new cell or in a main if running the code outside a notebook environment. The code below will store the returns value of each function in a variable.
pg_tickers
=
[
"C:EURUSD"
,
"C:MXNZAR"
]
fx_pairs_dict
=
get_fx_pairs_data
(
pg_tickers
)
formatted_fx_dict
=
format_fx_pairs
(
fx_pairs_dict
)
fx_pairs_str_ind
=
create_str_index
(
formatted_fx_dict
)
fx_returns_dict
=
create_returns_series
(
fx_pairs_str_ind
)
Python
The FX data for each pair can be accessed as below:
fx_returns_dict
[
'C:EURUSD'
]
Python
EURUSD formatted returns series
Realized volatility is a measure of risk. It measures the variability in an investment over a defined period of time. It is calculated as the square root of the sum of the squared return for a particular period of time. In contrast to
implied volatility
, realised volatility shows the actual change in historical prices, rather than a prediction of future volatility. However, it is possible to use the data to forecast the volatility in returns.
So why is this useful? If you can reliably predict volatility for a window of future time you can make an assessment of the market regime. This can help you determine the types of trading strategies you may want to use. For example in periods of low volatility you might look at trend following strategies on stocks. In a high volatility environment you might want to consider mean reverting strategies or options trading strategies such as
Straddle or Strangle
. By quantifying the level of risk present in your trading universe you can make better trading decisions.
Realised volatility is a way of understanding the degree of price movement for a given period. It is calculated as follows:
Collect the price data and calculate the returns
Square the returns to give more weight to larger changes
Calculate the average of the squared returns by adding them up and dividing by the number of periods (this is also known as the variance)
Take the square root of the variance (also know as the standard deviation). This is your measure of realised volatility
Ultimately we will be training a Support Vector Regressor to help us identify market regime change. In order to do this we will calculate realised volatility across a particular window of time. The window you select will depend upon the frequency of you data and the liquidity of the asset. As we are considering both a major and minor FX pair we will calculate our realised volatility over 30 data points. The Pandas library contains a rolling function that can be applied to any statistical aggregator using
method chaining
. Here we use
rolling()
with
std()
, to calculate rolling standard deviation. The rolling window size can be set to any number of fixed observations or timedeltas, such as business days.
Calculating Realised Volatility
The code below will create a new column in each of the fx pairs DataFrames which will contain the realised vol calculated as the rolling standard deviation of the last 30 data points. This will equate approximately to the last 30 minutes of data, with some variation depending on the trading frequency of the FX pair. Let's take a look at the realised volatility on a chart.
def
create_realised_vol
:
for
pair
in
pairs_rets_dict
:
pairs_rets_dict
[
pair
]
[
'realised_vol'
]
=
pairs_rets_dict
[
pair
]
[
'rets'
]
.
rolling
(
30
)
.
std
(
)
return
pairs_rets_dict
Python
Realised volatility for EURUSD and MXNZAR
As you can see the exotic pair MXNZAR has fewer data points than EURUSD, as it is thinly traded this is to be expected. In fact if we look at the length of both our DataFrames we can see that EURUSD has 31745 rows, whereas MXNZAR has 16865 rows almost half as many. In the previous article we discussed different options available to us that would allow us to handle the missing data points in the MXNZAR pair.
Exploring the Features of the Data
Before we begin to construct a machine learning model it is a good idea to have a look at the data and understand how its features could affect model performance. Why should you do this, why can't you just get going? Well, firstly the distribution of the data should influence your choice of model. For example, if your data was linearly distributed the use of linear regression model or a linear kernel in an SVM would be appropriate. If the data is non-lineaer a polynomial regression or radial basis function kernel would be more appropriate. You might also want to consider
transforming or scaling
your data to make it more compatible with your chosen model. Perharps your data contains outliers or missing data that you need to be aware of. There could be correlation (or linear association) between the features you have selected to train your model. There are many reasons why it is important to get to know your data before you start training a model. In fact it's estimated that between 70-80% of an engineer's time if spend toiling with data. Let's take a look at a simple exploratory data analysis:
First we will look at the distribution of the returns for both our FX pairs. We use the Seaborn function
kdeplot()
to visualise the distributions.
fig
,
ax
=
plt
.
subplots
(
2
,
2
,
figsize
=
(
16
,
10
)
,
squeeze
=
False
)
y_data
=
[
'rets'
,
'realised_vol'
]
for
idx
,
fxpair
in
enumerate
(
fx_pairs_str_ind
.
keys
(
)
)
:
for
idx2
,
dfcol
in
enumerate
(
y_data
)
:
row
=
(
idx
)
col
=
(
idx2
%
2
)
sns
.
kdeplot
(
fx_pairs_str_ind
[
fxpair
]
[
dfcol
]
,
ax
=
ax
[
row
]
[
col
]
,
bw_adjust
=
0.5
)
ax
[
row
]
[
col
]
.
set_title
(
fxpair
)
fig
.
suptitle
(
"KDE of Returns & Realised Volatility"
)
plt
.
show
(
)
Python
Returns and realised Volatility distributions
Firstly it's important to note that the scale on the x axis for EURUSD and MXNZAR is different. The range of returns is larger for MXNZAR in both the poistive and negative direction when compare to EURUSD. This is reflected in the realised volatility. We can also see that we have positive kurtosis of all distributions when compared to a standard normal distribution. The realised volatility also have positive skew. This information will be helpful when we are trying to choose or improve any machine learning models that we might want to use. Finally in the exoctic FX pair MXNZAR, we can see two small humps just before -0.01 and 0.01. This indicates the presence of spikes in our returns data.
To look into these with more detail we can use
Violin plot
. This will allows us to see how many instances of these spikes there are in our data and how far from the bulk of the distribution they are. The following code will create the violin plot.
fig
,
ax
=
plt
.
subplots
(
1
,
2
,
figsize
=
(
16
,
10
)
,
squeeze
=
False
)
for
idx
,
fxpair
in
enumerate
(
fx_pairs_str_ind
.
keys
(
)
)
:
row
=
(
idx
//
2
)
col
=
(
idx
%
2
)
sns
.
violinplot
(
x
=
'rets'
,
data
=
fx_pairs_str_ind
[
fxpair
]
,
ax
=
ax
[
row
]
[
col
]
,
inner
=
'point'
)
ax
[
row
]
[
col
]
.
set_title
(
fxpair
)
fig
.
suptitle
(
"Violin plot of rets"
)
plt
.
show
(
)
Python
Violin plot of returns
As you can see there are more points further from the bulk of the distribution in the MXNZAR FX pair. This means that there are more occassions where there is a higher or lower return when compared to the EURUSD. This information alone could be useful when you are considering trading strategies.
As we are planning to put our data into a machine learning model we also need to think about correlation. We want to use the realised volatility and the returns as features in a training set for a Support Vector Regressor. We need to ensure that these two features aren't correlated, otherwise we would be giving the machine learning model the same information twice. The model would become overspecified which could lead to poor model performance. Below we will look at the correlation between our two chosen features. In order to use both features we would be looking for a correlation close to zero.
The following function will display a scatter plot of the two columns; returns and realised vol. It also uses the Pandas
pd.DataFrame.corr
method
to calculate the correlation between the two.
fig
,
ax
=
plt
.
subplots
(
1
,
2
,
figsize
=
(
16
,
10
)
,
squeeze
=
False
)
for
idx
,
fxpair
in
enumerate
(
fx_pairs_str_ind
.
keys
(
)
)
:
row
=
(
idx
//
2
)
col
=
(
idx
%
2
)
print
(
f"vol and rets correlation for
{
fxpair
}
is "
f"
{
fx_pairs_str_ind
[
fxpair
]
[
'rets'
]
.
corr
(
fx_pairs_str_ind
[
fxpair
]
[
'realised_vol'
]
)
}
"
)
sns
.
scatterplot
(
data
=
fx_pairs_str_ind
[
fxpair
]
,
x
=
'rets'
,
y
=
'realised_vol'
,
ax
=
ax
[
row
]
[
col
]
)
ax
[
row
]
[
col
]
.
set_title
(
fxpair
)
fig
.
suptitle
(
"Correlation between realised vol and returns"
)
plt
.
show
(
)
Python
vol and rets correlation for C:EURUSD is 0.008572524273933109
vol and rets correlation for C:MXNZAR is 0.018750753882674935
None
FX pairs returns and realised vol correlation
Now that we know our returns series and realised vol have little correlation we can use them as separate features in our machine learning model. In the next article we will be building a Support Vector Regressor to determine how reliably we can predict the next value of realised volatility using the returns and realised volatility values from the previous time point. This will allow us to prepare a pipeline for our data, understand some of the model parameters and prepare for the final specification of the model.
Related Articles
Creating a Returns Series with Polygon's Forex Data |
| Markdown | - [QuantStart](https://www.quantstart.com/)
- [Quantcademy](https://www.quantstart.com/quantcademy/)
- [Books](https://www.quantstart.com/articles/calculating-realised-volatility-with-polygon-forex-data/)
[Successful Algorithmic Trading](https://www.quantstart.com/successful-algorithmic-trading-ebook/)
[Advanced Algorithmic Trading](https://www.quantstart.com/advanced-algorithmic-trading-ebook/)
[C++ For Quantitative Finance](https://www.quantstart.com/cpp-for-quantitative-finance-ebook/)
- [QSTrader](https://www.quantstart.com/qstrader/)
- [Articles](https://www.quantstart.com/articles/)
- [Login](https://www.quantstart.com/members/login/)
- [QuantStart](https://www.quantstart.com/)
- [Quantcademy](https://www.quantstart.com/quantcademy/)
- [Books](https://www.quantstart.com/articles/calculating-realised-volatility-with-polygon-forex-data/)
- [Successful Algorithmic Trading](https://www.quantstart.com/successful-algorithmic-trading-ebook/)
- [Advanced Algorithmic Trading](https://www.quantstart.com/advanced-algorithmic-trading-ebook/)
- [C++ For Quantitative Finance](https://www.quantstart.com/cpp-for-quantitative-finance-ebook/)
- [QSTrader](https://www.quantstart.com/qstrader/)
- [Articles](https://www.quantstart.com/articles/)
- [Login](https://www.quantstart.com/members/login/)
Calculating Realised Volatility with Polygon Forex data
In this article we carry out an exploratory data analysis of minutely Forex pairs data
In the previous [article](https://www.quantstart.com/articles/creating-a-returns-series-with-polygons-forex-data/) we wrote a Python function which utilised the Polygon API to extract a month of minutely data for both a major (EURUSD) and exotic (MZXZAR) FX pair. We plotted the returns series and looked at some of the issues that can occur when working with this type of data. This article is part of series where we will be creating a machine learning model which uses realised volatility to predict market regime change. In this article we will be learning how to calculate realised volatility. We will also be conducting a exploratory data analysis to prepare the feature set for our machine learning model.
In order to follow along with the code in this tutorial you will need:
- Python 3.9
- Matplotlib 3.5
- Pandas 1.4
- Requests 2.27
- Seaborn 0.12.
We also recommend that you take a look at our [early career research series](https://www.quantstart.com/articles/creating-an-algorithmic-trading-prototyping-environment-with-jupyter-notebooks-and-plotly/) to create an algorithmic trading environment inside a Jupyter Notebook.
## Realised Volatility
In order to calculate realised volatility we first need to obtain and format the data. In the previous article we created Python functions to contact the Polygon API and obtain a month of minutely data for EURUSD and MZXZAR. You can view the full article [here](https://www.quantstart.com/articles/creating-a-returns-series-with-polygons-forex-data/). Below you will find the code to obtain the data.
```
```
Python
Copy
The functions can be called in a new cell or in a main if running the code outside a notebook environment. The code below will store the returns value of each function in a variable.
```
```
Python
Copy
The FX data for each pair can be accessed as below:
```
fx_returns_dict['C:EURUSD']
```
Python
Copy

EURUSD formatted returns series
Realized volatility is a measure of risk. It measures the variability in an investment over a defined period of time. It is calculated as the square root of the sum of the squared return for a particular period of time. In contrast to [implied volatility](https://www.quantstart.com/articles/Implied-Volatility-in-C-using-Template-Functions-and-Interval-Bisection/), realised volatility shows the actual change in historical prices, rather than a prediction of future volatility. However, it is possible to use the data to forecast the volatility in returns.
So why is this useful? If you can reliably predict volatility for a window of future time you can make an assessment of the market regime. This can help you determine the types of trading strategies you may want to use. For example in periods of low volatility you might look at trend following strategies on stocks. In a high volatility environment you might want to consider mean reverting strategies or options trading strategies such as [Straddle or Strangle](https://www.investopedia.com/articles/active-trading/040515/how-profit-volatility.asp). By quantifying the level of risk present in your trading universe you can make better trading decisions.
Realised volatility is a way of understanding the degree of price movement for a given period. It is calculated as follows:
1. Collect the price data and calculate the returns
2. Square the returns to give more weight to larger changes
3. Calculate the average of the squared returns by adding them up and dividing by the number of periods (this is also known as the variance)
4. Take the square root of the variance (also know as the standard deviation). This is your measure of realised volatility
Ultimately we will be training a Support Vector Regressor to help us identify market regime change. In order to do this we will calculate realised volatility across a particular window of time. The window you select will depend upon the frequency of you data and the liquidity of the asset. As we are considering both a major and minor FX pair we will calculate our realised volatility over 30 data points. The Pandas library contains a rolling function that can be applied to any statistical aggregator using [method chaining](https://practicaldatascience.co.uk/data-science/how-to-use-method-chaining-in-pandas). Here we use `rolling()` with `std()`, to calculate rolling standard deviation. The rolling window size can be set to any number of fixed observations or timedeltas, such as business days.
## Calculating Realised Volatility
The code below will create a new column in each of the fx pairs DataFrames which will contain the realised vol calculated as the rolling standard deviation of the last 30 data points. This will equate approximately to the last 30 minutes of data, with some variation depending on the trading frequency of the FX pair. Let's take a look at the realised volatility on a chart.
```
```
Python
Copy

Realised volatility for EURUSD and MXNZAR
As you can see the exotic pair MXNZAR has fewer data points than EURUSD, as it is thinly traded this is to be expected. In fact if we look at the length of both our DataFrames we can see that EURUSD has 31745 rows, whereas MXNZAR has 16865 rows almost half as many. In the previous article we discussed different options available to us that would allow us to handle the missing data points in the MXNZAR pair.
## Exploring the Features of the Data
Before we begin to construct a machine learning model it is a good idea to have a look at the data and understand how its features could affect model performance. Why should you do this, why can't you just get going? Well, firstly the distribution of the data should influence your choice of model. For example, if your data was linearly distributed the use of linear regression model or a linear kernel in an SVM would be appropriate. If the data is non-lineaer a polynomial regression or radial basis function kernel would be more appropriate. You might also want to consider [transforming or scaling](https://scikit-learn.org/stable/modules/preprocessing.html) your data to make it more compatible with your chosen model. Perharps your data contains outliers or missing data that you need to be aware of. There could be correlation (or linear association) between the features you have selected to train your model. There are many reasons why it is important to get to know your data before you start training a model. In fact it's estimated that between 70-80% of an engineer's time if spend toiling with data. Let's take a look at a simple exploratory data analysis:
First we will look at the distribution of the returns for both our FX pairs. We use the Seaborn function [kdeplot()](https://seaborn.pydata.org/generated/seaborn.kdeplot.html) to visualise the distributions.
```
```
Python
Copy

Returns and realised Volatility distributions
Firstly it's important to note that the scale on the x axis for EURUSD and MXNZAR is different. The range of returns is larger for MXNZAR in both the poistive and negative direction when compare to EURUSD. This is reflected in the realised volatility. We can also see that we have positive kurtosis of all distributions when compared to a standard normal distribution. The realised volatility also have positive skew. This information will be helpful when we are trying to choose or improve any machine learning models that we might want to use. Finally in the exoctic FX pair MXNZAR, we can see two small humps just before -0.01 and 0.01. This indicates the presence of spikes in our returns data.
To look into these with more detail we can use [Violin plot](https://seaborn.pydata.org/generated/seaborn.violinplot.html). This will allows us to see how many instances of these spikes there are in our data and how far from the bulk of the distribution they are. The following code will create the violin plot.
```
```
Python
Copy

Violin plot of returns
As you can see there are more points further from the bulk of the distribution in the MXNZAR FX pair. This means that there are more occassions where there is a higher or lower return when compared to the EURUSD. This information alone could be useful when you are considering trading strategies.
As we are planning to put our data into a machine learning model we also need to think about correlation. We want to use the realised volatility and the returns as features in a training set for a Support Vector Regressor. We need to ensure that these two features aren't correlated, otherwise we would be giving the machine learning model the same information twice. The model would become overspecified which could lead to poor model performance. Below we will look at the correlation between our two chosen features. In order to use both features we would be looking for a correlation close to zero.
The following function will display a scatter plot of the two columns; returns and realised vol. It also uses the Pandas `pd.DataFrame.corr` [method](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html) to calculate the correlation between the two.
```
```
Python
Copy
```
```
None
Copy

FX pairs returns and realised vol correlation
Now that we know our returns series and realised vol have little correlation we can use them as separate features in our machine learning model. In the next article we will be building a Support Vector Regressor to determine how reliably we can predict the next value of realised volatility using the returns and realised volatility values from the previous time point. This will allow us to prepare a pipeline for our data, understand some of the model parameters and prepare for the final specification of the model.
## Related Articles
- [Creating a Returns Series with Polygon's Forex Data](https://www.quantstart.com/articles/creating-a-returns-series-with-polygons-forex-data/)
[](https://www.quantstart.com/quantcademy/?ref=art)
### [The Quantcademy](https://www.quantstart.com/quantcademy/?ref=art)
Join the Quantcademy membership portal that caters to the rapidly-growing retail quant trader community and learn how to increase your strategy profitability.
[Find Out More](https://www.quantstart.com/quantcademy/?ref=art)
[](https://www.quantstart.com/successful-algorithmic-trading-ebook/)
### [Successful Algorithmic Trading](https://www.quantstart.com/successful-algorithmic-trading-ebook/)
How to find new trading strategy ideas and objectively assess them for your portfolio using a Python-based backtesting engine.
[Find Out More](https://www.quantstart.com/successful-algorithmic-trading-ebook/)
[](https://www.quantstart.com/advanced-algorithmic-trading-ebook/)
### [Advanced Algorithmic Trading](https://www.quantstart.com/advanced-algorithmic-trading-ebook/)
How to implement advanced trading strategies using time series analysis, machine learning and Bayesian statistics with R and Python.
[Find Out More](https://www.quantstart.com/advanced-algorithmic-trading-ebook/)
- QuantStart
- [About](https://www.quantstart.com/about/)
- [Articles](https://www.quantstart.com/articles/)
- [Sitemap](https://www.quantstart.com/sitemap/)
- Products
- [Quantcademy](https://www.quantstart.com/quantcademy/)
- [QSTrader](https://www.quantstart.com/qstrader/)
- [Successful Algorithmic Trading](https://www.quantstart.com/successful-algorithmic-trading-ebook/)
- [Advanced Algorithmic Trading](https://www.quantstart.com/advanced-algorithmic-trading-ebook/)
- [C++ For Quantitative Finance](https://www.quantstart.com/cpp-for-quantitative-finance-ebook/)
- Legal
- [Privacy Policy](https://www.quantstart.com/privacy-policy/)
- [Terms & Conditions](https://www.quantstart.com/terms-and-conditions/)
- Social
- [Twitter](https://twitter.com/quantstart)
- [YouTube](https://www.youtube.com/channel/UCmVnnZ6Y2TrJtY1eQJN6kWA)
©2012-2026 QuarkGluon Ltd. All rights reserved. |
| Readable Markdown | In the previous [article](https://www.quantstart.com/articles/creating-a-returns-series-with-polygons-forex-data/) we wrote a Python function which utilised the Polygon API to extract a month of minutely data for both a major (EURUSD) and exotic (MZXZAR) FX pair. We plotted the returns series and looked at some of the issues that can occur when working with this type of data. This article is part of series where we will be creating a machine learning model which uses realised volatility to predict market regime change. In this article we will be learning how to calculate realised volatility. We will also be conducting a exploratory data analysis to prepare the feature set for our machine learning model.
In order to follow along with the code in this tutorial you will need:
- Python 3.9
- Matplotlib 3.5
- Pandas 1.4
- Requests 2.27
- Seaborn 0.12.
We also recommend that you take a look at our [early career research series](https://www.quantstart.com/articles/creating-an-algorithmic-trading-prototyping-environment-with-jupyter-notebooks-and-plotly/) to create an algorithmic trading environment inside a Jupyter Notebook.
In order to calculate realised volatility we first need to obtain and format the data. In the previous article we created Python functions to contact the Polygon API and obtain a month of minutely data for EURUSD and MZXZAR. You can view the full article [here](https://www.quantstart.com/articles/creating-a-returns-series-with-polygons-forex-data/). Below you will find the code to obtain the data.
```
```
Python
The functions can be called in a new cell or in a main if running the code outside a notebook environment. The code below will store the returns value of each function in a variable.
```
```
Python
The FX data for each pair can be accessed as below:
```
fx_returns_dict['C:EURUSD']
```
Python

EURUSD formatted returns series
Realized volatility is a measure of risk. It measures the variability in an investment over a defined period of time. It is calculated as the square root of the sum of the squared return for a particular period of time. In contrast to [implied volatility](https://www.quantstart.com/articles/Implied-Volatility-in-C-using-Template-Functions-and-Interval-Bisection/), realised volatility shows the actual change in historical prices, rather than a prediction of future volatility. However, it is possible to use the data to forecast the volatility in returns.
So why is this useful? If you can reliably predict volatility for a window of future time you can make an assessment of the market regime. This can help you determine the types of trading strategies you may want to use. For example in periods of low volatility you might look at trend following strategies on stocks. In a high volatility environment you might want to consider mean reverting strategies or options trading strategies such as [Straddle or Strangle](https://www.investopedia.com/articles/active-trading/040515/how-profit-volatility.asp). By quantifying the level of risk present in your trading universe you can make better trading decisions.
Realised volatility is a way of understanding the degree of price movement for a given period. It is calculated as follows:
1. Collect the price data and calculate the returns
2. Square the returns to give more weight to larger changes
3. Calculate the average of the squared returns by adding them up and dividing by the number of periods (this is also known as the variance)
4. Take the square root of the variance (also know as the standard deviation). This is your measure of realised volatility
Ultimately we will be training a Support Vector Regressor to help us identify market regime change. In order to do this we will calculate realised volatility across a particular window of time. The window you select will depend upon the frequency of you data and the liquidity of the asset. As we are considering both a major and minor FX pair we will calculate our realised volatility over 30 data points. The Pandas library contains a rolling function that can be applied to any statistical aggregator using [method chaining](https://practicaldatascience.co.uk/data-science/how-to-use-method-chaining-in-pandas). Here we use `rolling()` with `std()`, to calculate rolling standard deviation. The rolling window size can be set to any number of fixed observations or timedeltas, such as business days.
## Calculating Realised Volatility
The code below will create a new column in each of the fx pairs DataFrames which will contain the realised vol calculated as the rolling standard deviation of the last 30 data points. This will equate approximately to the last 30 minutes of data, with some variation depending on the trading frequency of the FX pair. Let's take a look at the realised volatility on a chart.
```
```
Python

Realised volatility for EURUSD and MXNZAR
As you can see the exotic pair MXNZAR has fewer data points than EURUSD, as it is thinly traded this is to be expected. In fact if we look at the length of both our DataFrames we can see that EURUSD has 31745 rows, whereas MXNZAR has 16865 rows almost half as many. In the previous article we discussed different options available to us that would allow us to handle the missing data points in the MXNZAR pair.
## Exploring the Features of the Data
Before we begin to construct a machine learning model it is a good idea to have a look at the data and understand how its features could affect model performance. Why should you do this, why can't you just get going? Well, firstly the distribution of the data should influence your choice of model. For example, if your data was linearly distributed the use of linear regression model or a linear kernel in an SVM would be appropriate. If the data is non-lineaer a polynomial regression or radial basis function kernel would be more appropriate. You might also want to consider [transforming or scaling](https://scikit-learn.org/stable/modules/preprocessing.html) your data to make it more compatible with your chosen model. Perharps your data contains outliers or missing data that you need to be aware of. There could be correlation (or linear association) between the features you have selected to train your model. There are many reasons why it is important to get to know your data before you start training a model. In fact it's estimated that between 70-80% of an engineer's time if spend toiling with data. Let's take a look at a simple exploratory data analysis:
First we will look at the distribution of the returns for both our FX pairs. We use the Seaborn function [kdeplot()](https://seaborn.pydata.org/generated/seaborn.kdeplot.html) to visualise the distributions.
```
```
Python

Returns and realised Volatility distributions
Firstly it's important to note that the scale on the x axis for EURUSD and MXNZAR is different. The range of returns is larger for MXNZAR in both the poistive and negative direction when compare to EURUSD. This is reflected in the realised volatility. We can also see that we have positive kurtosis of all distributions when compared to a standard normal distribution. The realised volatility also have positive skew. This information will be helpful when we are trying to choose or improve any machine learning models that we might want to use. Finally in the exoctic FX pair MXNZAR, we can see two small humps just before -0.01 and 0.01. This indicates the presence of spikes in our returns data.
To look into these with more detail we can use [Violin plot](https://seaborn.pydata.org/generated/seaborn.violinplot.html). This will allows us to see how many instances of these spikes there are in our data and how far from the bulk of the distribution they are. The following code will create the violin plot.
```
```
Python

Violin plot of returns
As you can see there are more points further from the bulk of the distribution in the MXNZAR FX pair. This means that there are more occassions where there is a higher or lower return when compared to the EURUSD. This information alone could be useful when you are considering trading strategies.
As we are planning to put our data into a machine learning model we also need to think about correlation. We want to use the realised volatility and the returns as features in a training set for a Support Vector Regressor. We need to ensure that these two features aren't correlated, otherwise we would be giving the machine learning model the same information twice. The model would become overspecified which could lead to poor model performance. Below we will look at the correlation between our two chosen features. In order to use both features we would be looking for a correlation close to zero.
The following function will display a scatter plot of the two columns; returns and realised vol. It also uses the Pandas `pd.DataFrame.corr` [method](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html) to calculate the correlation between the two.
```
```
Python
```
```
None

FX pairs returns and realised vol correlation
Now that we know our returns series and realised vol have little correlation we can use them as separate features in our machine learning model. In the next article we will be building a Support Vector Regressor to determine how reliably we can predict the next value of realised volatility using the returns and realised volatility values from the previous time point. This will allow us to prepare a pipeline for our data, understand some of the model parameters and prepare for the final specification of the model.
## Related Articles
- [Creating a Returns Series with Polygon's Forex Data](https://www.quantstart.com/articles/creating-a-returns-series-with-polygons-forex-data/) |
| Shard | 57 (laksa) |
| Root Hash | 15201169621511058857 |
| Unparsed URL | com,quantstart!www,/articles/calculating-realised-volatility-with-polygon-forex-data/ s443 |