Self-Similarity of Network Data Analysis

Self-similarity is a special feature of network data. When operating network data, the tranditional time series models, for example a autoregressive moving average model (ARMA(p,q)), are not appropriate. Since it only provides a finite parameters model, but the network data usually has long-range dependent structure. Thus, a self-similar process for describing the network data structrue is applied. In the following paragraphs, we mention the definitions of the self-similar processes and some properties of them. At the same time, we describe some important methods for checking and applying the self-similarity of network data, such as estimating the Hurst's parameter and proposing a proper modeling method.

Definition

Suppose $X$ be a weakly-stationary (2nd-order stationary) process with mean $\mu$ , variance $\sigma ^{2}$ , and autocorrelation function $\gamma (t)$ . Assume that the autocorrelation function $\gamma (t)$ has the form $\gamma (t)\rightarrow t^{-\beta }L(t)$ as $t\to \infty$ , where $0<\beta <1$ and $L(t)$ is a slowly varying function at infinity, that is $\lim _{t\to \infty }{\frac {L(tx)}{L(t)}}=1$ for all $x>0$ . For example, $L(t)=const$ and $L(t)=\log(t)$ are slowly varying functions.
Let $X_{k}^{(m)}={\frac {1}{m^{H}}}(X_{km-m+1}+\cdot \cdot \cdot +X_{km})$ , where $k=1,2,3,\ldots$ , denote an aggregated point series over non-overlapping blocks of size $m$ , for each $m$ is a positive interger.

Exactly self-similar process

$X$ is called an exactly self-similar process if there exists a self-similar parameter $H$ such that $X_{k}^{(m)}$ has the same distribution as $X$ . An example of exactly self-similar process with $H$ is Fractional Gaussian Noise (FGN) with ${\frac {1}{2}}<H<1$ .

Def:Fractional Gaussion Noise (FGN)
$X(t)=B_{H}(t+1)-B_{H}(t),~\forall t\geq 1$ is called the Fractional Gaussion Noise, where $B_{H}(\cdot )$ is a Fraction Brownian Motion.

exactly second order self-similar process

$X$ is called an exactly second order self-similar process if there exists a self-similar parameter $H$ such that $X_{k}^{(m)}$ has the same variance and autocorrelation as $X$ .

asymptotic second order self-similar process

$X$ is called an asymptotic second order self-similar process with self-similar parameter $H$ if $\gamma ^{(m)}(t)\to {\frac {1}{2}}[(t+1)^{2H}-2t^{2H}+(t-1)^{2H}]$ as $m\to \infty$ , $~\forall t=1,2,3,\ldots$

Properties of Self-Similar Processes

Long-Range-Dependence(LRD)

Suppose $X(t)$ be a weakly-stationary (2nd-order stationary) process with mean $\mu$ and variance $\sigma ^{2}$ . The Autocorrelation Function (ACF) of lag $t$ is given by $\gamma (t)={\mathrm {cov} (X(h),X(h+t)) \over \sigma ^{2}}={E[(X(h)-\mu )(X(h+t)-\mu )] \over \sigma ^{2}}$
Definition:
A weakly stationary process is said to be "Long-Range-Dependence" if $\sum _{t=0}^{\infty }|\gamma (t)|=\infty$

A process which satisfies $\gamma (t)\rightarrow t^{-\beta }L(t)$ as $t\to \infty$ is said to have long-range dependence. The spectral density function of long-range dependence follows a power law near the origin. Equivalently to $\gamma (t)\rightarrow t^{-\beta }L(t)$ , $X$ has long-range dependence if the spectral density function of autocorrelation function, $f_{t}(w)=\sum _{t=0}^{\infty }\gamma (t)e^{iwt}$ , has the form of $w^{-\gamma }L(w)$ as $w\to 0$ where $0<\gamma <1$ , $L$ is slowly varying at 0.

also see

Slowly decaying variances

$X_{k}^{(m)}={\frac {1}{m}}(X_{km-m+1}+\cdot \cdot \cdot +X_{km})$ When an autocorrelation function of a self-similar process satisfies $\gamma (t)\rightarrow t^{-\beta }L(t)$ as $t\to \infty$ , that means it also satisfies $Var(X^{(m)})\to am^{-\beta }$ as $m\to \infty$ , where $a$ is a finite positive constant independent of m, and 0<β<1.

Testing for self-similarity

R/S analysis

Assume that the underlying process $X$ is Fraction Gaussion Noise. Consider the series $X_{(1)},\ldots ,X_{(n)}$ , and let $Y_{(n)}=\sum _{i=1}^{n}X_{(i)}$ .
The sample variance of $X_{(i)}$ is $S^{2}(n)={\frac {1}{n}}\sum _{i=1}^{n}X_{(i)}^{2}-({\frac {1}{n}})^{2}Y_{n}^{2}$
Definition:R/S statistic

${\frac {R}{S}}(n)={\frac {1}{S(n)}}[\max _{0\leq t\leq n}(Y_{t}-{\frac {t}{n}}Y_{n})-\min _{0\leq t\leq n}(Y_{t}-{\frac {t}{n}}Y_{n})]$

If $X_{(i)}$ is FGN, then $E({\frac {R}{S}}(n))\to C_{H}\times n^{H}$
Consider fitting a regression model : $log{\frac {R}{S}}(n)=log(C_{H})+Hlog(n)+\epsilon _{n}$ , where $\epsilon _{n}\thicksim N(0,\sigma ^{2})$
In particular for a time series of length $N$ divide the time series data into $k$ groups each of size ${\frac {N}{k}}$ , compute ${\frac {R}{S}}(n)$ for each group.
Thus for each n we have $k$ pairs of data ( $log(n),log{\frac {R}{S}}(n)$ ).There are $k$ points for each $n$ , so we can fit a regression model to estimate $H$ more accurately. If the solpe of the regression line is between 0.5~1, it is a self-similar process.
File:Self similar pox plot.jpg

Variance-time plot

Variance of the sample mean is given by $Var({\bar {X}}_{n})\to cn^{2H-2},~\forall c>0$ .
For estimating H, calculate sample means ${\bar {X}}_{1},{\bar {X}}_{2},\cdots ,{\bar {X}}_{m_{k}}$ for $m_{k}$ sub-series of length $k$ .
Overall mean can be given by ${\bar {X}}(k)={\frac {1}{k}}\sum _{i=1}^{m_{k}}{\bar {X}}_{i}(k)$ , sample variance $S^{2}(k)={\frac {1}{m_{k}-1}}\sum _{i=1}^{m_{k}}({\bar {X}}_{i}(k)-{\bar {X}}(k))^{2}$ .
The variance-time plots are obtained by plotting $\log S^{2}(k)$ against $\log k$ and we can fit a simple least square line through the resulting points in the plane ignoring the small values of k.

For large values of $k$ , the points in the plot are expected to be scattered around a straight line with a negative slope $2H-2$ .For short-range dependence or independence among the observations, the slope of the straight line is equal to -1.
Self-similarity can be inferred from the values of the estimated slope which is asymptotically between –1 and 0, and an estimate for the degree of self-similarity is given by ${\hat {H}}=1+{\frac {1}{2}}(slope).$

File:Self similar variancetime plot.jpg

Periodogram-based analysis

Whittle's approximate maximum likelihood estimator (MLE) is applied to solve the Hurst's parameter via the spectral density of $X$ . It is not only a tool for visualizing the Hurst's parameter, but also a method to do some statistical inference about the parameters via the asymptotic properties of the MLE. In particular, $X$ follows a Gaussian process. Let the spectral density of $X$ , $f_{x}(w;\theta )=\sigma _{\epsilon }^{2}f_{x}(w;(1,\eta ))$ , where $\theta =(\sigma _{\epsilon }^{2},\eta )=(\sigma _{\epsilon }^{2},H,\theta _{3},\ldots ,\theta _{k}),H={\frac {\gamma +1}{2}}$ , and $\theta _{3},\ldots ,\theta _{k}$ construct a short-range time series autoregression (AR) model, that is $X_{j}=\sum _{i=1}^{k}\alpha _{i}X_{j-i}+\epsilon _{j}$ , with $Var(\epsilon _{j})=\sigma _{\epsilon }^{2}$ .

Thus, the Whittle's estimator ${\hat {\eta }}$ of $\eta$ minimizes the function $Q(\eta )=\int _{-\pi }^{\pi }{\frac {I(w)}{f(w;(1,\eta ))}}\,dw$ , where I(w) denotes the periodogram of X as $(2\pi n)^{-1}|\sum _{j=1}^{n}X_{j}e^{iwj}|^{2}$ and ${\hat {\sigma }}^{2}=\int _{-\pi }^{\pi }{\frac {I(w)}{f(w;(1,{\hat {\eta }}))}}\,dw$ . These integrations can be assessed by Riemann sum.

Then $n^{1/2}({\hat {\theta }}-\theta )$ asymptotically follows a normal distribution if $X_{j}$ can be expressed as a form of a infinite moving average model.

To estimate $H$ , first, one has to calculate this periodogram. Since $I_{n}(w)$ is an estimator of the spectral density, a series with long-range dependence should have a periodogram, which is proportional to $|\lambda |^{1-2H}$ close to the origin. The periodogram plot is obtained by ploting $\log(I_{n}(w))$ against $\log(w)$ .
Then fitting a regression model of the $\log(I_{n}(w))$ on the $\log(w)$ should give a slop of ${\hat {\beta }}$ . The slope of the fitted straight line is also the estimation of $1-2H$ . Thus, the estimation ${\hat {H}}$ is obtained.

File:Self similar periodogram plot.jpg

Note:
There are two common problems when we apply the periodogram method. First, if the data does not follow a Gaussian distribution, transformation of the data can solve this kind of problems. Second, the sample spectrum which deviates from the assumed spectral density is another one. An aggregation method is suggested to solve this problem. If $X$ is a Gaussian process and the spectral density function of $X$ satisfies $w^{-\gamma }L(w)$ as $w\to \infty$ , the function, $m^{-H}L^{-{\frac {1}{2}}}(m)\sum _{i=(j-1)m+1}^{m}k(X_{i}-E(|X_{i}|)),~j=1,2,\ldots ,[{\tfrac {n}{m}}]$ , converges in distribution to FGN as $m\to \infty$ .

FARIMA modeling

The time series methodology is helpful in network data analysis. Especially when data is non-stationary, a fractional autoregressive integrated moving average (FARIMA) method is applied. Like the traditional ARIMA(p, d, q) model, the only difference is the value of the parameter $d$ must be between -0.5 and 0.5. When $d\in (0,0.5)$ , the data has long-range dependence, and the Hurst's parameter is equal to $d+0.5$ . The advantage of this method is useful to capture the short-range dependence via ARMA (p, q) and capture the LRD property at the same time.

Modeling procedures of FARIMA:

Step1: Consider $\{X_{t},~t=1,2,\ldots ,n\}$ . Let $X_{t}^{'}=X_{t}-{\bar {X}}_{n}$ such that $E(X_{t}^{'})=0$ .
Step2: Use R/S (or other estimator) to estimate Hurst's parameter $H$ , and then the estimation of $d$ is obtained by ${\hat {d}}={\hat {H}}-0.5.$
Step3: Take $W_{t}=\nabla ^{\hat {d}}X_{t}^{'}$ , where $\nabla ^{\hat {d}}=(1-B)^{\hat {d}}=\sum _{j=1}^{\infty }\pi _{j}(-B)^{j}$ .

'B' is called 'Back shift operator', that is $B^{\,j}X_{t}=X_{t-j}.$

$\pi _{j}={\frac {\Gamma (d+1)}{\Gamma (j+1)\Gamma (d-j+1)}}=\Pi _{0<k\leq j}{\frac {k-1-d}{k}},~j=0,1,\ldots$

$\Gamma (x)=\left\{{\begin{array}{lll}\int _{0}^{\infty }t^{x-1}e^{-t}dt&,x>0\\\infty &,x=0\\x^{-1}\Gamma (1+x)&,x<0\end{array}}\right.$

Step4: Fit an ARMA(p,q) model using the transformation data $W_{t}$ .

Thus, a FARIMA(p,d,q) model is obtained as the form $\nabla ^{\hat {d}}X_{t}^{'}=\sum _{i=1}^{\hat {p}}a_{i}X_{t-i}^{'}+\sum _{j=1}^{\hat {q}}b_{j}\epsilon _{t-j}+\epsilon _{t}$

References

P. Whittle, "Estimation and information in stationary time series", Art. Mat. 2, 423-434, 1953.
K. PARK,W. WILLINGER, Self-Similar Network Traffic and Performance Evaluation,WILEY,2000.
W. E. Leland,W. Willinger,M. S. Taqqu,D. V. Wilson, "On the self-similar nature of Ethernet traffic",ACM SIGCOMM Computer Communication Review 25,202-213,1995.
W. Willinger,M. S. Taqqu,W. E. Leland,D. V. Wilson, "Self-Similarity in High-Speed Packet Traffic: Analysis and Modeling of Ethernet Traffic Measurements",Statistical Science 10,67-85,1995.