In this notebook, we will be conducting a time series analysis on the Ethereum price. As of late, cryptocurrencies have been incredibly popular in the media and in the financial community. Let’s see if we can build a good time series model.
ether = read.csv(file.choose(), header=T)
attach(ether)
head(ether)
dim(ether)
[1] 866 3
colnames(ether)
[1] "Date.UTC." "UnixTimeStamp" "Value"
Let’s plot date against the price (Note: this will create a funky looking plot. But we will do it for the sake of an example)
plot(Date.UTC., Value)
Now, in order to take advantage of the data variable we will need to transform. If you are new to time series analysis, it is important that you transform your date explanatory variable.
Now, let’s do this the correct way.
time = 1:866
plot(time, Value)
From this plot, we can see that there isn’t really a linear relationship of Value with time. Let’s create a time squared variable and build a simple linear regression model (one without the time squared variable, and one with it).
m1 = lm(Value~time)
summary(m1)
Call:
lm(formula = Value ~ time)
Residuals:
Min 1Q Median 3Q Max
-127.37 -72.99 4.16 60.79 443.83
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -92.46162 5.66996 -16.31 <2e-16 ***
time 0.40199 0.01133 35.48 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 83.36 on 864 degrees of freedom
Multiple R-squared: 0.593, Adjusted R-squared: 0.5925
F-statistic: 1259 on 1 and 864 DF, p-value: < 2.2e-16
It appears that we have a significant model where the F p-value=2.2e-16 < alpha=.05. Let’s see if the squared term will improve the R2 and R2a value
time_sq = time ** 2
m2 = lm(Value~time+time_sq)
summary(m2)
Call:
lm(formula = Value ~ time + time_sq)
Residuals:
Min 1Q Median 3Q Max
-87.329 -36.385 2.573 30.218 295.398
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.804e+01 5.049e+00 11.50 <2e-16 ***
time -6.384e-01 2.689e-02 -23.74 <2e-16 ***
time_sq 1.200e-03 3.004e-05 39.95 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 49.41 on 863 degrees of freedom
Multiple R-squared: 0.8571, Adjusted R-squared: 0.8568
F-statistic: 2589 on 2 and 863 DF, p-value: < 2.2e-16
The residual standard error decreased by half, and the R2 and R2a increased greatly. We now have a model that explains 85.71% of the variation amongst Value, which represents the price of one ether.
Let’s see if we can perform a random walk and conduct some forecasting
new_value = ts(Value)
plot(new_value, type="b")
So it appears that Value isn’t really a random walk. It only becomes a random walk after time 600.
ether[600,1]
[1] 3/20/2017
866 Levels: 1/1/2016 1/1/2017 1/10/2016 1/10/2017 1/11/2016 ... 9/9/2017
If we look at the 600th observation, we find the date to be at 3/20/2017. It makes sense why the data appears to look more like a random walk because this is when the value of ethereum began to become more volatile as more people started to become a part of ethereum (mining, create DApps, etc.).
Let’s make another plot after time 600 to see if it looks more like a random walk.
new_value = Value[600:length(Value)]
time_value = ts(new_value)
plot(time_value, type="b")
For this random walk let’s conduct some forecasting. We will first see which time lag contributes the most to forecasting the value of ethereum.
ether[nrow(ether), ]
tail(new_value)
[1] 422.48 421.15 513.29 656.52 699.09 693.58
new_value[length(new_value)]
[1] 693.58
From this output, we find that the last day recorded is on 12/14/2017. The last value of Value is $693.58. Let’s conduct a forecast to today (12/18/2017), and a week from the 14th.
Now this is incredibly off the mark. The price for ethereum today is $789. The reason why the price has increased, is because there has been so much activity within the past week.
diff = new_value[-1]- new_value[1:1-length(new_value)]
c_bar = mean(diff)
four_days = new_value[length(new_value)] + 4*c_bar
seven_days = new_value[length(new_value)] + 7*c_bar
four_days
[1] 703.4534
seven_days
[1] 710.8584
The random walk forecasts that in 4 days the price of ethereum will be $703.4534 and in a week it will be $710.8584.
acf(time_value, lag.max=20,plot=F)
pacf(time_value, lag.max=20)
From the output from the autocorrelation, and the plot from the partial autocorrelation, we find that lag1 sufficiently explains what is occuring at the subsequent timestamps. Let’s create an AR(1) model using lag1.
yt = new_value[-1]
ylag1= new_value[-length(new_value)]
plot(ylag1, yt)
We find a very strong linear relationship between the current time and the previous timestamp. Let us build the model.
# AR(1)
m2 = lm(yt~ylag1)
summary(m2)
Call:
lm(formula = yt ~ ylag1)
Residuals:
Min 1Q Median 3Q Max
-55.390 -7.473 -1.638 5.452 138.648
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.492190 2.599781 0.189 0.85
ylag1 1.007967 0.009481 106.316 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 18.91 on 264 degrees of freedom
Multiple R-squared: 0.9772, Adjusted R-squared: 0.9771
F-statistic: 1.13e+04 on 1 and 264 DF, p-value: < 2.2e-16
So we have an incredible model in which the R2 value=.9772, and the R2a=.9971. The residual standard error decreased compared to model one, and we have a very significant model. In the next week or so, we will pull in more data to verify, by predicting on, the current value of ethereum. This is very exciting!
Data Science Rules!
Until next time,
Joel Afriyie