aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorNiklas Halle <niklas@niklashalle.net>2021-03-04 16:50:43 +0100
committerNiklas Halle <niklas@niklashalle.net>2021-03-04 16:50:43 +0100
commit0a34cf266e6d80fc0555debd1270945420b9df87 (patch)
treecd99336d992b13fd9dc59dd0f1b41ac36fc9d055
parent275fa376ad4e89775bbb68a5dd9ef9af7a807fbd (diff)
downloadbachelor_thesis-0a34cf266e6d80fc0555debd1270945420b9df87.tar.gz
bachelor_thesis-0a34cf266e6d80fc0555debd1270945420b9df87.zip
proposal update, still TODO: time estimates
-rw-r--r--code/python/generate_graphs.py2
-rw-r--r--latex/proposal/Proposal.pdfbin212695 -> 200424 bytes
-rw-r--r--latex/proposal/Proposal.tex38
-rw-r--r--latex/thesis/Thesis.pdfbin259708 -> 258585 bytes
-rw-r--r--latex/thesis/Thesis.tex3
-rw-r--r--latex/thesis/common_thesis_header.tex1
6 files changed, 12 insertions, 32 deletions
diff --git a/code/python/generate_graphs.py b/code/python/generate_graphs.py
index d603ec5..6b9d7f2 100644
--- a/code/python/generate_graphs.py
+++ b/code/python/generate_graphs.py
@@ -42,7 +42,7 @@ for i in range(0, number_of_graphs):
# run embedding, flatten features to simple list
embedding = nk.embedding.Node2Vec(synthetic_g) # <-- missing parameters? ask Ahrens
- embedding.run() # <-- segfaults after first call, why?
+ embedding.run()
features = embedding.getFeatures()
flat_features = sum(features, [])
diff --git a/latex/proposal/Proposal.pdf b/latex/proposal/Proposal.pdf
index 42e71fa..fa28b37 100644
--- a/latex/proposal/Proposal.pdf
+++ b/latex/proposal/Proposal.pdf
Binary files differ
diff --git a/latex/proposal/Proposal.tex b/latex/proposal/Proposal.tex
index 590faa7..dffc1e3 100644
--- a/latex/proposal/Proposal.tex
+++ b/latex/proposal/Proposal.tex
@@ -1,3 +1,4 @@
+% !TeX spellcheck = en_GB
\documentclass{article}
\usepackage{xargs}
@@ -10,7 +11,7 @@
\usepackage[a4paper, total={16cm, 24cm}]{geometry}
\usepackage[colorinlistoftodos,prependcaption,textsize=footnotesize]{todonotes}
\usepackage[backend=biber, %% Hilfsprogramm "biber" (statt "biblatex" oder "bibtex")
- style=authoryear, %% Zitierstil (siehe Dokumentation) %authoryear
+ style=alphabetic, %% Zitierstil
natbib=true, %% Bereitstellen von natbib-kompatiblen Zitierkommandos
hyperref=true, %% hyperref-Paket verwenden, um Links zu erstellen
sorting=ynt,
@@ -19,39 +20,23 @@
\include{common_thesis_header}
-\title{Proposal: How does the performance of random walk based graph embedding compare to “traditional” algorithms on the examples of degree centrality, eigenvector centrality and clustering?}
+\title{\titletext\ - Proposal}
\author{Niklas Halle}
-\date{February 20, 2021}
\begin{document}
\maketitle
- \unsure{The title is probably too long / detailed?}
\section{Purpose}
- We want to compare the performance of “traditional” algorithms to find specific graph properties with the performance of a machine learning-based approach centred around graph embedding (on the grounds of random walks (\ntv)).\\Evaluation shall include speed, accuracy, stability and scalability \improvement{more/less factors to compare2?} in order to find whether certain problems might benefit from such a machine learning based approach. \improvement{Does that cover it?}
- \inlineToDo{Maybe mention \nk\ somehow?}
+ The thesis will aim to find out whether “traditional”, computationally intensive algorithmic tasks on graphs can be “solved better” by an approach that uses graph representation in a vector space (“embedding” via \ntv) as input for a deep neural network. “Better” here qualifies either “faster” (in application) or of “higher quality” (or both). To find whether improvements, especially on “difficult” problems, can be made, we want to look at tasks for which there are only non-linear exact solutions so far. Among the centrality measures, Betweenness Centrality, Closeness Centrality and Electrical Closeness Centrality are therefore suitable candidates. Maybe we will also be able to look at np-hard optimisation problems, such as group centrality for different measures. In addition to centrality measures, we will also look at community detection, which is a well-studied problem in the machine learning field already. Nevertheless, it is still interesting to see how the machine learning-based process fares against, for example, PLM\footnote{parallel Louvain method}.
\section{Justification}
- While some research into embedding based approaches has already been conducted, there is little direct comparison within the same framework, especially one as widely used as \nk.\\
- Finding whether the interface and data provided by \nk is useful for certain problems and may even be competitive with “traditional” algorithms, can help further research in embedding based approaches and their application.
- \inlineToDo{TODO: extend?}
+ While some research into embedding based approaches has already been conducted, there is little direct comparison within the same framework, especially one as widely used as \nk. Finding whether the interface and data provided by \nk are useful for certain problems and may even be competitive with “traditional” algorithms can help further research on embedding-based approaches and their application. Furthermore, depending on our findings, we maybe could be able to determine how useful a machine learning-based approach is compared to “traditional” algorithms, e.g. whether some heuristics might become obsolete eventually if a machine learning-based is accurate and/or fast (enough).
\section{Literature review}
- There is a few articles and papers going into graph embedding and its application. A good example is \cite{GOYAL201878}, which is a survey on a collection of different graph embedding techniques, evaluating their applications and performances. It also includes the random walked based \ntv, which will be the base in this thesis.\\
- They find that “Choosing the right balance enables \ntv\ to preserve community structure as well as structural equivalence between nodes” but also that “Embeddings learned by \ntv\ have low reconstruction precision,” “\ntv\ outperforms other methods on the task of node classification,” and “\ntv\ preserves homophily as well as structural equivalence between nodes [which] suggest this can be useful in node classification”. Therefore comparing it with “traditional” algorithms for clustering in \nk\ (which in essence is node classification) is evident.\\
- Another interesting study on embedding and clustering is \cite{rozemberczki2019gemsec}, which is quite a bit different though, as it does the embedding and the clustering step at once, not in sequence based on the embedding.\\
- On the other hand, there is few data on the effectiveness of \ntv\ for approximating different centrality scores. Therefore we will also evaluate two of theses, to see if approximation of specific numeric properties of graphs might also be feasible using this approach in \nk.
+ Several articles and papers are dealing with graph embedding and its application. A good example is \cite{GOYAL201878}, a survey on a collection of different graph embedding techniques, evaluating their applications and performances. It also includes the random walked based \ntv, which will be the base in this thesis. They find that “Choosing the right balance enables \ntv\ to preserve community structure as well as structural equivalence between nodes” but also that “Embeddings learned by \ntv\ have low reconstruction precision,” “\ntv\ outperforms other methods on the task of node classification,” and “\ntv\ preserves homophily as well as structural equivalence between nodes [which] suggest this can be useful in node classification”. Therefore comparing it with “traditional” algorithms for clustering in \nk\ (which in essence is node classification) is evident. Another interesting study on embedding and clustering is \cite{rozemberczki2019gemsec}, which is quite different though, as it does the embedding and the clustering step at once, not in sequence based on the embedding. On the other hand, there are few data on the effectiveness of \ntv\ for approximating different centrality scores. Therefore we will also evaluate two of theses to see if an approximation of specific numeric properties of graphs might also be feasible using this approach in \nk.
\section{Method}
- The broad idea is to find a good set of synthetic as well as real world graphs, for which cluster can either be retrieved well or are already known. For the same graphs we want to have a small set of properties besides clustering which can be calculated exactly and have not been (a major) part of previous studies, in order to see how an \ntv-embedding + machine learning approach might perform on those.\\
- we will then familiarize myself with the machine learning framework \tf, which we intend to use for the design, training and evaluation of the machine learning based part. Planned is one network for each of the properties, but if the time is enough we could also look into trying to create one which might try to do multiple at once (which obviously could heavily boost performance \unsure{Good idea?}).\\
- Afterwards, results of the machine learning part will be compared with those of “traditional” algorithms already provided by \nk.
- \inlineToDo{Describe your proposed research methodology: Qualitative or quantitative etc\\
- Describe your time frame\\
- Describe how you intend to do this in the time available\\
- Describe your resources\\
- Deal with any ethical or access issues\\
- Describe how you intend to do your research with the available resources, what methods you intend to use.}
+ The rough idea is to find a good set of synthetic and real graphs for which clusters can either be well retrieved or are already known. For the same graphs, we want to have a small set of features besides clustering that can be accurately computed and have not been (a major) part of previous studies to see how an \ntv-embedding + machine learning approach might perform on these. We will then use the machine learning framework \tf\unsure{or scikit-learn, to be evaluated}, which we plan to use to design, train and evaluate the machine learning-based part. One deep neural network for each of the compared features is planned. The machine learning part results will then be compared with those of “traditional” algorithms already provided by \nk. For the community detection comparison, we plan to use an LFR\footnote{Lancichinetti-Fortunato-Radicchi} benchmark.
\subsection{Time frame}
\inlineToDo{Estimate time}
\begin{enumerate}
@@ -71,7 +56,7 @@
\item Separate data into training and test data sets
\item Designing the network
\begin{enumerate}
- \item Embed the graphs, determine the dimensions (in order to determine the size (of the first layer) of the neural network [hyperparameter, is then fixed]). \info{Is this fixed in size independent of graph size?}
+ \item Embed the graphs, determine the dimensions (in order to determine the size (of the first layer) of the neural network [hyper parameter, is then fixed]). \info{Is this fixed in size independent of graph size?}
\item Network design: layer-, node count
\end{enumerate}
\item Training
@@ -85,15 +70,8 @@
\item Evaluate comparison
\item Outlook
\end{enumerate}
-
- \section{Dissemination}
- \improvement{Not sure about this section, do we need it?}
- \inlineToDo{Describe how the findings will be used\\
- Evaluate this use\\
- Describe how the research findings will be disseminated}
\newpage
\nocite{*}
\printbibliography
- \inlineToDo{Which citation style \link{https://www.overleaf.com/learn/latex/Biblatex\_bibliography\_styles} is desired?}
\end{document} \ No newline at end of file
diff --git a/latex/thesis/Thesis.pdf b/latex/thesis/Thesis.pdf
index 37b8908..9fa6c00 100644
--- a/latex/thesis/Thesis.pdf
+++ b/latex/thesis/Thesis.pdf
Binary files differ
diff --git a/latex/thesis/Thesis.tex b/latex/thesis/Thesis.tex
index 90b53c1..e0e9d00 100644
--- a/latex/thesis/Thesis.tex
+++ b/latex/thesis/Thesis.tex
@@ -1,3 +1,4 @@
+% !TeX spellcheck = en_GB
\documentclass[
a4paper,
pagesize,
@@ -38,7 +39,7 @@
\begin{document}
% Beispielhafte Nutzung der Vorlage für die Titelseite (bitte anpassen):
\input{Institutsvorlage}
- \titel{How does the performance of random walk based graph embedding compare to “traditional” algorithms on the examples of degree centrality, eigenvector centrality and clustering?} % Titel der Arbeit
+ \titel{\titletext} % Titel der Arbeit
\typ{Bachelor Thesis} % Typ der Arbeit: Diplomarbeit, Masterarbeit, Bachelorarbeit
\grad{Bachelor of Science (B. Sc.)} % erreichter Akademischer Grad
% z.B.: Master of Science (M. Sc.), Master of Education (M. Ed.), Bachelor of Science (B. Sc.), Bachelor of Arts (B. A.), Diplominformatikerin
diff --git a/latex/thesis/common_thesis_header.tex b/latex/thesis/common_thesis_header.tex
index 0ee736d..f05a5d8 100644
--- a/latex/thesis/common_thesis_header.tex
+++ b/latex/thesis/common_thesis_header.tex
@@ -27,6 +27,7 @@
\newcommand{\link}[1]{\href{#1}{#1}}
% shorthands
+\newcommand{\titletext}{Comparison of graph representations and traditional graph algorithms}
\newcommand{\nk}{\textit{NetworKit}}
\newcommand{\tf}{\textit{TensorFlow}}
\newcommand{\ntv}{\textit{node2vec}} \ No newline at end of file