407 lines
18 KiB
TeX
Executable file
407 lines
18 KiB
TeX
Executable file
\documentclass{article}
|
|
%\usepackage{times}
|
|
|
|
|
|
%\usepackage{amsthm}
|
|
\usepackage{algorithm} % for pseudo-codes
|
|
%\usepackage{enumerate}
|
|
\usepackage{algpseudocode}
|
|
%\usepackage{DotArrow}
|
|
%\usepackage{subfig}
|
|
%\usepackage{amsmath, environ}
|
|
%\usepackage{amssymb}
|
|
%\usepackage{amsthm}
|
|
%%\usepackage{bm}
|
|
%%\usepackage{mathtools}
|
|
\usepackage{graphicx}
|
|
%\usepackage{wrapfig}
|
|
%\usepackage{minibox}
|
|
%\usepackage{multirow}
|
|
%\usepackage{pifont} % \ding{}
|
|
%\usepackage[percent]{overpic}
|
|
%\usepackage{scrextend} % labeling environment
|
|
%
|
|
%\newtheorem{theorem}{Theorem}
|
|
%\newtheorem{definition}{Definition}
|
|
%\newtheorem{example}{Example}
|
|
%
|
|
%\renewcommand{\textfraction}{0.05}
|
|
%\renewcommand{\floatpagefraction}{0.75}
|
|
|
|
\newcommand{\centerimage}[5][-7pt]{ % [vskip] | graphics-opt | imgname | label | caption
|
|
\begin{figure}[!htb]%
|
|
\centering%
|
|
\vspace{#1}%
|
|
\includegraphics[#2]{#3}%
|
|
\caption{#5}\label{#4}\vspace{2mm}%
|
|
%\vspace{-3pt}\caption{#5}\label{#4}\vspace{-2pt}%
|
|
\end{figure}}
|
|
|
|
%\includegraphics[trim=1cm 2cm 3cm 4cm, clip=true]{example.pdf}
|
|
|
|
\begin{document}
|
|
|
|
\title{Assignment 2 - A.V. Reti Complesse 18/19}
|
|
|
|
\author{Francesco Galla`\\
|
|
\texttt{francesco.galla@edu.unito.it}
|
|
\and
|
|
Francesco Mecca\\
|
|
\texttt{francesco.mecca402@edu.unito.it}
|
|
}
|
|
|
|
|
|
\maketitle
|
|
|
|
|
|
%=========================================================================
|
|
%%%%%%% GitHub Model %%%%%%%
|
|
%=========================================================================
|
|
|
|
\section{Network Analysis} \label{sec:ex144}
|
|
|
|
%=========================================================================================================
|
|
\subsection{The High Energy Physics collaboration network} \label{ssec:description}
|
|
|
|
The network chosen for analysis is the \textit{High Energy Physics Collaboration Network}. The network covers scientific collaborations between authors papers submitted to High Energy Physics - Theory category. If an author $i$ co-authored a paper with author $j$, the graph contains a \textbf{undirected edge} $(i,j)$. If the paper is co-authored by $k$ authors this generates a \textbf{completely connected (sub)graph} on $k$ nodes.
|
|
|
|
\begin{figure}[H]
|
|
\begin{center}
|
|
\includegraphics[height=7cm]{plots/network.png}
|
|
\caption{High Energy Physics Collaboration Network}
|
|
\label{rete}
|
|
\end{center}
|
|
\end{figure}
|
|
|
|
Network source: \textit{https://snap.stanford.edu/data/ca-HepTh.html}
|
|
|
|
%=========================================================================================================
|
|
\subsection{Network structure} \label{ssec:structure}
|
|
|
|
We start by describing the basic features of our network.
|
|
|
|
\begin{itemize}
|
|
\item Network type: undirected graph
|
|
\item Number of nodes: $9879$
|
|
\item Number of edges: $25999$
|
|
\item Number of \textbf{possible} edges (of the corresponding complete network): $48792381$
|
|
\item Density (number of edges / number of possible edges): 0.05\%
|
|
\end{itemize}
|
|
|
|
As it can be seen from the low density of the network, this model is far from being a complete graph.
|
|
|
|
%=========================================================================================================
|
|
\subsection{Graph filtering} \label{ssec:cleanup}
|
|
|
|
Prior to the analysis of the graph the absence of self loops and isolated nodes had to be verified and if such edges and nodes were found, they would have to be removed.
|
|
The collaboration network chosen contains few self loops and an irrelevant number of isolated nodes, which have been removed for completeness anyway.
|
|
|
|
\begin{itemize}
|
|
\item Number of self-loops: $25$
|
|
\item Number of isolates: $2$
|
|
\end{itemize}
|
|
|
|
The graph on which further analysis will be performed has the following characteristics.
|
|
|
|
\begin{itemize}
|
|
\item Number nodes, without isolates: $9877$
|
|
\item Number of edges, without self loops: $25974$
|
|
\end{itemize}
|
|
|
|
%=========================================================================================================
|
|
\subsection{Graph Analysis} \label{ssec:analysis}
|
|
|
|
%=========================================================================================================
|
|
\subsubsection{Properties} \label{ssec:analysis-quantities}
|
|
\begin{enumerate}
|
|
\item Distances
|
|
\item Degree
|
|
\item Clustering coefficient
|
|
\item Greatest connected component
|
|
\item Degree correlation
|
|
\item Communities
|
|
\item Centralities
|
|
\end{enumerate}
|
|
|
|
%=========================================================================================================
|
|
\subsubsection{Distances} \label{ssec:analysis-distances}
|
|
|
|
We computed the average distance between nodes in our model and show the distance distribution plot.
|
|
|
|
\begin{itemize}
|
|
\item Average distance: 5.94
|
|
\end{itemize}
|
|
|
|
The average distance of our network confirms the influence of the small world phenomenon, since the resulting value is close to the one expected from the \textit{six degrees of separation theory}.
|
|
|
|
\begin{figure}[H]
|
|
\begin{center}
|
|
\includegraphics[height=8cm]{plots/DistDistributionGlobal.png}
|
|
\caption{Distance distribution}
|
|
\label{distDistrTotal}
|
|
\end{center}
|
|
\end{figure}
|
|
|
|
%=========================================================================================================
|
|
\subsubsection{Degree} \label{ssec:analysis-degree}
|
|
|
|
\begin{itemize}
|
|
\item Average degree: 5.26
|
|
\item Variance: 38.21
|
|
\item Standard deviation: 6.18
|
|
\end{itemize}
|
|
|
|
As it happens in most of the real collaboration networks, the degree distribution is highly asymmetric (or skewed): most of the nodes (the trivial many) have low degrees while a small but significant fraction of nodes (the vital few, called \textbf{hubs}) have an extraordinarily high degree. Since the probability of hubs, although low, is significant, the degree distribution, when plotted as a function of the degree, shows a long tail.
|
|
As noted from the degree distribution graph, the network follows a \textbf{power law} and therefore is a \textbf{scale free} network.
|
|
|
|
\begin{figure}[H]
|
|
\begin{center}
|
|
\includegraphics[height=8cm]{plots/LinScaleDegreeDistr.png}
|
|
\caption{Linear degree distribution}
|
|
\label{degreeDistrLinear}
|
|
\end{center}
|
|
\end{figure}
|
|
|
|
\begin{figure}[H]
|
|
\begin{center}
|
|
\includegraphics[height=8cm]{plots/LogScaleDegreeDistr.png}
|
|
\caption{Logarithmic degree distribution}
|
|
\label{degreeDistrLog}
|
|
\end{center}
|
|
\end{figure}
|
|
|
|
%=========================================================================================================
|
|
\subsubsection{Clustering Coefficient} \label{ssec:analysis-clustering-coefficient}
|
|
|
|
Clustering and transitivity coefficients are in line with values from other collaboration networks. As an example, the transitivity coefficient of the computer science collaboration network is around 25\%, as shown by a famous study by Mark Newman. The transitivity coefficient of our network shows a high percentage of \textit{closed triangles}. Along with the high value of the clustering coefficient, it confirms that our network is subject to the \textit{small world phenomenon}.
|
|
|
|
%% TODO MN Paper: http://www.pnas.org/content/98/2/404.full
|
|
|
|
\begin{itemize}
|
|
\item Average Clustering coefficient: 0.47
|
|
\item Transitivity coefficient: 28.40\%
|
|
\end{itemize}
|
|
|
|
%=========================================================================================================
|
|
\subsubsection{Degree correlation}
|
|
|
|
Degree Assortativity is the tendency of a node to connect with neighbors of similar degree, which can also be quantified as the average degree of the nearest neighbors of a node as a function of the node degree. Our social network displays a positive correlation with an increasing assortativity as the node degree increases, which shows the tendency of influential authors to connect with others of similar importance.
|
|
|
|
\begin{figure}[H]
|
|
\begin{center}
|
|
\includegraphics[height=8cm]{plots/Assortativity.png}
|
|
\caption{Assortativity}
|
|
\label{assortativity}
|
|
\end{center}
|
|
\end{figure}
|
|
|
|
%=========================================================================================================
|
|
\subsubsection{Communities}
|
|
|
|
Community detection requires the application of graph partitioning algorithms: we chose the modularity-based Louvain algorithm for its efficiency and the 4-clique community algorithm since its expected results should be close to the ones produced by the Louvain algorithm.
|
|
|
|
\begin{figure}[H]
|
|
\begin{center}
|
|
\includegraphics[height=10cm]{plots/gcomm2.png}
|
|
\caption{Communities}
|
|
\label{Gcomm}
|
|
\end{center}
|
|
\end{figure}
|
|
|
|
Using Louvain's algorithm we found $471$ communities in our network, shown in the plot. From the graphical representation it emerges the presence of different \textbf{hubs} connected to nodes belonging to the same community. Since the nodes of our graph are research authors, we speculate that these ones are the most influential ones, thereby connected with other influential people in the collaboration network. This evidence confirms the results of the Degree Correlation analysis.
|
|
|
|
To refine our search for communities in the graph, we show the result of the application of the 4-clique community, which identified $1177$ communities.
|
|
|
|
From the higher result obtained with the 4-clique algorithm we have a stronger confirmation of the tendency of nodes (authors) to form communities in our collaboration network.
|
|
|
|
\subsubsection{Centralities}
|
|
|
|
We analyze graph centrality considering the following properties:
|
|
|
|
\begin{itemize}
|
|
\item Degree centrality
|
|
\item Closeness centrality
|
|
\item Harmonic centrality
|
|
\item Eigenvector centrality
|
|
\item Betweenness centrality
|
|
\item PageRank
|
|
\item Hits Hubs and Authorities
|
|
\end{itemize}
|
|
|
|
Centralities parameters have been associated in tuples and the ones with the highes correlation degree have been selected. For the first 5 results we found very high correlation indexes: a evidence of the strong presence of hubs inside our network.
|
|
|
|
\begin{table}
|
|
\begin{center}
|
|
\caption{Highest correlation degree between centrality measures}
|
|
\label{tab:upd1}
|
|
\begin{tabular}{l|c|c|c|c|c|c}
|
|
\hline
|
|
\textbf{Centrality 1} & \textbf{Centrality 2} & \textbf{Correlation Degree} \\[2ex]
|
|
|
|
Hubs & Authorities & 1.000000 \\
|
|
Eigenvector & Authorities & 0.999998 \\
|
|
Hubs & Eigenvector & 0.999998 \\
|
|
Closeness & Harmonic Closeness & 0.999408 \\
|
|
Degree & PageRank & 0.889680 \\
|
|
\hline
|
|
\end{tabular}
|
|
\end{center}
|
|
\end{table}
|
|
|
|
\subsubsection{GCC}
|
|
|
|
\begin{figure}[H]
|
|
\begin{center}
|
|
\includegraphics[height=8cm]{plots/gcc.png}
|
|
\caption{GCC of the collaboration network}
|
|
\label{gcc}
|
|
\end{center}
|
|
\end{figure}
|
|
|
|
The first step to compute the largest Connected Component is to partition our network into its connected components. We found 428 connected components, of which the largest can be classified as a Giant Connected Component and has the following properties:
|
|
|
|
\begin{itemize}
|
|
\item Number of nodes: 8638
|
|
\item Number of edges: 24806
|
|
\item Percentage of nodes: 87.46\%
|
|
\item Percentage of edges: 95.50\%
|
|
\item Density: 0.07\%
|
|
\end{itemize}
|
|
|
|
Our GCC contains a significant portion of nodes and edges of the whole network, while its density is higher than the one of the network, which can be explained from the fact that the percentage of nodes contained in the GCC is lower than the percentage of edges: $87.46\%$ against $95.50\%$.
|
|
|
|
\subsubsection{GCC Diameter - Radius - Center - Periphery}
|
|
|
|
To continue our analysis of the network GCC, we considered the following characteristics:
|
|
|
|
\begin{itemize}
|
|
\item GCC - Diameter: 18
|
|
\item GCC - Radius: 10
|
|
\end{itemize}
|
|
|
|
\begin{figure}[H]
|
|
\begin{center}
|
|
\includegraphics[height=8cm]{plots/gcccenter.png}
|
|
\caption{Center - GCC}
|
|
\label{Gcomm}
|
|
\end{center}
|
|
\end{figure}
|
|
|
|
\begin{figure}[H]
|
|
\begin{center}
|
|
\includegraphics[height=8cm]{plots/gccper.png}
|
|
\caption{Periphery - GCC}
|
|
\label{Gcomm}
|
|
\end{center}
|
|
\end{figure}
|
|
|
|
We used the eccentricity of every node of the GCC to compute radius and diameter for the component. We notice that the diameter (maximum eccentricity between all nodes) is close to the double of the radius (minimal eccentricity).
|
|
|
|
Having computed the eccentricity of each node, we used it to plot the center and the periphery of the GCC. The center is the set of nodes whose eccentricity is equal to the GCC radius, while the periphery is the set of nodes whose eccentricity is equal to the GCC diameter. We notice that our collaboration network display very distant nodes even in the GCC, which can easily be seen from the periphery plot~\ref{Gcomm}.
|
|
|
|
|
|
%=========================================================================================================
|
|
% RANDOM MODEL
|
|
%=========================================================================================================
|
|
|
|
\section{Random Model: the Barabasi-Albert model}
|
|
|
|
An important step in network analysis is the comparison with a syntethic network, to study the influences of phenomena such as the \textit{small world phenomenon}. For this purpose we chose the \textbf{Barabasi-Albert model} since its preferential attachment algorithm is best suited to generate \textit{scale-free} networks.
|
|
|
|
The network is created using the same number of nodes of our real-world network, with a degree of \textbf{3} for each new node inserted. We omitthe analysis of the GCC since the artificial network contains only one connected component, which covers the whole graph.
|
|
|
|
\begin{figure}[H]
|
|
\begin{center}
|
|
\includegraphics[height=7cm]{plots/barabasi.png}
|
|
\caption{Synthetic Barabasi-Albert network}
|
|
\label{rete}
|
|
\end{center}
|
|
\end{figure}
|
|
|
|
%=========================================================================================================
|
|
\subsection{Network structure} \label{ssec:structure}
|
|
|
|
\begin{itemize}
|
|
\item Network type: undirected graph
|
|
\item Number of nodes: $9879$
|
|
\item Number of edges: $19754$
|
|
\item Number of \textbf{possible} edges (of the corresponding complete network): $48792381$
|
|
\item Density (number of edges / number of possible edges): 0.04\%
|
|
\end{itemize}
|
|
|
|
Even The values for edge count and density are lower to the ones of the real world network, the difference is minimal.
|
|
|
|
\subsection{Graph Analysis} \label{ssec:analysis}
|
|
|
|
|
|
%=========================================================================================================
|
|
\subsubsection{Distances} \label{ssec:analysis-distances}
|
|
|
|
We computed the average distance between nodes in our model and show the distance distribution plot.
|
|
|
|
\begin{itemize}
|
|
\item Average distance: 5.07
|
|
\end{itemize}
|
|
|
|
Distances in a Barabasi-Albert model follow a logarithmic increase with respect to the number of nodes in the network. This explains why, on average, distances for our artificial model are lower than the ones in the collaboration network.
|
|
|
|
\begin{figure}[H]
|
|
\begin{center}
|
|
\includegraphics[height=8cm]{plots/B_DistDistributionGlobal.png}
|
|
\caption{Distance distribution}
|
|
\label{distDistrTotal}
|
|
\end{center}
|
|
\end{figure}
|
|
|
|
%=========================================================================================================
|
|
\subsubsection{Degree} \label{ssec:analysis-degree}
|
|
|
|
\begin{itemize}
|
|
\item Average degree: 4.00
|
|
\item Variance: 36.17
|
|
\item Standard deviation: 6.01
|
|
\end{itemize}
|
|
|
|
The values are very similar to the ones in the real network, since both graphs follow a \textit{power law} and thus belong to the class of \textit{scale-free} networks. This is particularly visible in the logarithmic plot~\ref{degreeDistrLog}.
|
|
|
|
\begin{figure}[H]
|
|
\begin{center}
|
|
\includegraphics[height=8cm]{plots/B_LinScaleDegreeDistr.png}
|
|
\caption{Linear degree distribution}
|
|
\label{degreeDistrLinear}
|
|
\end{center}
|
|
\end{figure}
|
|
|
|
\begin{figure}[H]
|
|
\begin{center}
|
|
\includegraphics[height=8cm]{plots/B_LogScaleDegreeDistr.png}
|
|
\caption{Logarithmic degree distribution}
|
|
\label{degreeDistrLog}
|
|
\end{center}
|
|
\end{figure}
|
|
|
|
%=========================================================================================================
|
|
\subsubsection{Clustering Coefficient} \label{ssec:analysis-clustering-coefficient}
|
|
|
|
A sensible difference with respect to the real collaboration network is given by the clustering coefficient property. Using the preferential attachment policy, each new node in the synthetic network will be more inclined to connect with a high-degree node with respect to a low-degree node. This generates very few closed triangles and thus a low clustering coefficient value (very close to 0). On the other hand, the real-world network shows higher values for both transitivity and average clustering coefficient: it is because many high-degree nodes (authors) tend to connect with low degree ones, givin them the opportunity to collaborate in a project.
|
|
|
|
\begin{itemize}
|
|
\item Average Clustering coefficient: 0.00
|
|
\item Transitivity coefficient: 0.16\%
|
|
\end{itemize}
|
|
|
|
%=========================================================================================================
|
|
\subsubsection{Degree correlation}
|
|
|
|
Plotting degree assortativity on K-nearest-neighbours we discover a \textbf{disassortative} tendency of our network in nodes with degree smaller than $10$. Higher degree nodes mostly display a \textbf{neutral} behavior. This result is diverging from the assortative tendency of the real world network, in which the \textit{hub-to-hub} connections are more easily created.
|
|
This difference is mostly caused by the step-based generative process of the Barabasi-Albert model.
|
|
|
|
\begin{figure}[H]
|
|
\begin{center}
|
|
\includegraphics[height=8cm]{plots/B_Assortativity.png}
|
|
\caption{Assortativity}
|
|
\label{assortativity}
|
|
\end{center}
|
|
\end{figure}
|
|
|
|
\end{document}
|