aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorNiklas Halle <niklas@niklashalle.net>2021-02-21 15:58:35 +0100
committerNiklas Halle <niklas@niklashalle.net>2021-02-21 15:58:35 +0100
commitf10119b9443f821a8202cd5235f939d4a1967e04 (patch)
treec25a9cc53c091b2ae67d162e21c1aa72211c67da
parent962e5e78f7315303355e356675f2ad9140d62e24 (diff)
downloadbachelor_thesis-f10119b9443f821a8202cd5235f939d4a1967e04.tar.gz
bachelor_thesis-f10119b9443f821a8202cd5235f939d4a1967e04.zip
first real draft for the proposal
-rw-r--r--latex/proposal/Proposal.pdfbin209189 -> 211300 bytes
-rw-r--r--latex/proposal/Proposal.tex84
-rw-r--r--latex/thesis/Thesis.pdfbin258442 -> 260086 bytes
-rw-r--r--latex/thesis/Thesis.tex2
-rw-r--r--latex/thesis/chapter/notes.tex28
-rw-r--r--latex/thesis/common_thesis_header.tex25
6 files changed, 84 insertions, 55 deletions
diff --git a/latex/proposal/Proposal.pdf b/latex/proposal/Proposal.pdf
index 2adc1ba..e1f84ec 100644
--- a/latex/proposal/Proposal.pdf
+++ b/latex/proposal/Proposal.pdf
Binary files differ
diff --git a/latex/proposal/Proposal.tex b/latex/proposal/Proposal.tex
index 1be1891..6dd80c6 100644
--- a/latex/proposal/Proposal.tex
+++ b/latex/proposal/Proposal.tex
@@ -19,7 +19,7 @@
\include{common_thesis_header}
-\title{Proposal: How does the performance of random walk based graph embedding compare to "traditional" algorithms on the examples of degree centrality, eigenvector centrality and clustering?}
+\title{Proposal: How does the performance of random walk based graph embedding compare to “traditional” algorithms on the examples of degree centrality, eigenvector centrality and clustering?}
\author{Niklas Halle}
\date{February 20, 2021}
@@ -27,75 +27,69 @@
\maketitle
\section{Purpose}
- We want to compare the performance of "`traditional"' algorithms to find certain graph properties with the performance of an machine learning based approach centered around graph embedding (on the grounds of random walks (\ntv)).\\
- Evaluation shall include speed, accuracy, stability and scalability \improvement{more/less?} in order to find whether certain problems might benefit from such an machine learning based approach. \improvement{Does that cover it?}
- \inlineToDo{Maybe mention \nk somehow?}
+ We want to compare the performance of “traditional” algorithms to find specific graph properties with the performance of a machine learning-based approach centred around graph embedding (on the grounds of random walks (\ntv)).\\Evaluation shall include speed, accuracy, stability and scalability \improvement{more/less factors to compare2?} in order to find whether certain problems might benefit from such an machine learning based approach. \improvement{Does that cover it?}
+ \inlineToDo{Maybe mention \nk\ somehow?}
\section{Justification}
- While some research into embedding based approaches has already been conducted (e.g. see \cite{GOYAL201878}), none to few have done a direct comparison within the same framework, especially one as widely used as \nk.\\
- Finding whether the interface provided and data by \nk is useful for certain problems and maybe even competitive with "`traditional"' algorithms can help further research in embedding based approaches and their application.
+ While some research into embedding based approaches has already been conducted, there is little direct comparison within the same framework, especially one as widely used as \nk.\\
+ Finding whether the interface provided and data by \nk is useful for certain problems and maybe even competitive with “traditional” algorithms can help further research in embedding based approaches and their application.
\inlineToDo{TODO: extend?}
\section{Literature review}
- Report any previous research\\
- Give examples of previous research\\
- Evaluate any previous research\\
- Identify any gaps\\
- Describe how you intend to fill the gaps\\
- \\
- \cite{GOYAL201878} is a survey on a collection of different graph embedding techniques, evaluating their applications and performances. Among others, it also includes the random walked based \ntv, which will be the base in this thesis as well.\\
- They find that "`Choosing the right balance enables \ntv to preserve community
-structure as well as structural equivalence between nodes"' but also that "`Embeddings learned by \ntv have low
- reconstruction precision"'. "`\ntv outperforms other methods on the task of node classification"'
+ There is a few articles and papers going into graph embedding and its application. A good example is \cite{GOYAL201878}, which is a survey on a collection of different graph embedding techniques, evaluating their applications and performances. It also includes the random walked based \ntv, which will be the base in this thesis.\\
+ They find that “Choosing the right balance enables \ntv\ to preserve community structure as well as structural equivalence between nodes” but also that “Embeddings learned by \ntv\ have low reconstruction precision,” “\ntv\ outperforms other methods on the task of node classification,” and “\ntv\ preserves homophily as well as structural equivalence between nodes [which] suggest this can be useful in node classification”. Therefore comparing it with “traditional” algorithms for clustering in \nk\ (which in essence is node classification) is evident.\\
+ Another interesting study on embedding and clustering is \cite{rozemberczki2019gemsec}, which is quite a bit different though, as it does the embedding and the clustering step at once, not in sequence based on the embedding.\\
+ On the other hand, there is few data on the effectiveness of \ntv\ for approximating different centrality scores. Therefore we will also evaluate two of theses, to see if approximation of specific numeric properties of graphs might also be feasible using this approach in \nk.
\section{Method}
- Describe your proposed research methodology: Qualitative or quantitative etc\\
- Describe your time frame\\
- Describe how you intend to do this in the time available\\
- Describe your resources\\
- Deal with any ethical or access issues\\
- Describe how you intend to do your research with the available resources, what methods you intend to use.\\
+ The broad idea is to find a good set of synthetic as well as real world graphs, for which cluster can either be retrieved well or are already known. For the same graphs we want to have a small set of properties besides clustering which can be calculated exactly and have not been (a major) part of previous studies, in order to see how an \ntv-embedding + machine learning approach might perform on those.\\
+ I will then familiarize myself with the machine learning framework \tf, which I intend to use for the design, training and evaluation of the machine learning based part. Planned is one network for each of the properties, but if the time is enough we could also look into trying to create one which might try to do multiple at once (which obviously could heavily boost performance \unsure{Good idea?}).\\
+ Afterwards, results of the machine learning part will be compared with those of “traditional” algorithms already provided by \nk.
+ \inlineToDo{Describe your proposed research methodology: Qualitative or quantitative etc\\
+ Describe your time frame\\
+ Describe how you intend to do this in the time available\\
+ Describe your resources\\
+ Deal with any ethical or access issues\\
+ Describe how you intend to do your research with the available resources, what methods you intend to use.}
\subsection{Time frame}
+ \inlineToDo{Estimate time}
\begin{enumerate}
- \item Proposal schreiben
+ \item Collect graphs
\begin{enumerate}
- \item Literatur sammeln
+ \item By means of (synthetic) graph generation
+ \item By means of selection of natural graphs
\end{enumerate}
- \item Graphen sammeln
- \begin{enumerate}
- \item Über Generierung von synthetischen Graphen
- \item Über Auswahl natürlicher Graphen
- \end{enumerate}
- \item Berechnung der gewählten Centrality-Werte --> Test-/Trainingsdaten
+ \item Calculation of the selected centrality values and cluster sets --> Test and training data
\begin{enumerate}
- \item Datenformat entwickeln (Graph+Erwartungswert, Menge von solchen Paaren)
- \item Algorithmus basteln, der Graph annimmt, und Graph+Erwartungswert ausgibt / zu einer solchen Menge hinzufügt
+ \item Develop data format (graph+expected value, set of such pairs)
+ \item Build an algorithm that takes a graph and outputs graph+expected value / adds to such a set.
\end{enumerate}
- \item Training eines Neuralen Netzes
+ \item Familiarise with \tf
+ \item Train neural network
\begin{enumerate}
- \item Teilen in Test- und Trainingsdaten
- \item Entwerfen des Netzes
+ \item Separate data into training and test data sets
+ \item Designing the network
\begin{enumerate}
- \item Graphen Embedden, die Dimensionen ermitteln (um damit die Größe (des ersten Layers) des Neuralen Netzes zu bestimmen [Hyperparameter, ist dann fix])
- \item Netzentwurf: Layer-, Node-Count
+ \item Embed the graphs, determine the dimensions (in order to determine the size (of the first layer) of the neural network [hyperparameter, is then fixed]). \info{Is this fixed in size independent of graph size?}
+ \item Network design: layer-, node count
\end{enumerate}
\item Training
\begin{enumerate}
\item ???
\end{enumerate}
\end{enumerate}
- \item Testen und Auswertung
- \begin{enumerate}
- \item ??? \improvement{Profit?}
- \end{enumerate}
- \item Ausblick
+ \item Testing and evaluation
+ \item ...\improvement{Probably need some more steps here...}
+ \item Compare results
+ \item Evaluate comparison
+ \item Outlook
\end{enumerate}
\section{Dissemination}
- Describe how the findings will be used\\
+ \improvement{Not sure about this section, do we need it?}
+ \inlineToDo{Describe how the findings will be used\\
Evaluate this use\\
- Describe how the research findings will be disseminated\\
- \inlineToDo{Not sure about this section, do we need it?}
+ Describe how the research findings will be disseminated}
\newpage
\nocite{*}
diff --git a/latex/thesis/Thesis.pdf b/latex/thesis/Thesis.pdf
index 938c73a..4d54426 100644
--- a/latex/thesis/Thesis.pdf
+++ b/latex/thesis/Thesis.pdf
Binary files differ
diff --git a/latex/thesis/Thesis.tex b/latex/thesis/Thesis.tex
index b342e2a..3991381 100644
--- a/latex/thesis/Thesis.tex
+++ b/latex/thesis/Thesis.tex
@@ -37,7 +37,7 @@
\begin{document}
% Beispielhafte Nutzung der Vorlage für die Titelseite (bitte anpassen):
\input{Institutsvorlage}
- \titel{How does the performance of random walk based graph embedding compare to "traditional" algorithms on the examples of degree centrality, eigenvector centrality and clustering?} % Titel der Arbeit
+ \titel{How does the performance of random walk based graph embedding compare to “traditional” algorithms on the examples of degree centrality, eigenvector centrality and clustering?} % Titel der Arbeit
\typ{Bachelor Thesis} % Typ der Arbeit: Diplomarbeit, Masterarbeit, Bachelorarbeit
\grad{Bachelor of Science (B. Sc.)} % erreichter Akademischer Grad
% z.B.: Master of Science (M. Sc.), Master of Education (M. Ed.), Bachelor of Science (B. Sc.), Bachelor of Arts (B. A.), Diplominformatikerin
diff --git a/latex/thesis/chapter/notes.tex b/latex/thesis/chapter/notes.tex
index ac24c5b..48a6f0b 100644
--- a/latex/thesis/chapter/notes.tex
+++ b/latex/thesis/chapter/notes.tex
@@ -4,4 +4,30 @@ The main body consists of several chapters of background, ideas, methods, data,
The Introduction gives background knowledge that supports the reason for undertaking the research and an organisation\footnote{This is BE, deal with it} statement. It should clearly state the problem to be solved in the form of a research question or hypothesis and be clear about the need for the research and its significance. The Literature Review/Theory will set your research against a background of what is already known about the topic in question, and be clear about the gap to be filled and the significance of this. The Methodology section gives detailed information of how the information in the dissertation was obtained. It should persuade your readers that the research was done well so the results can be believed. Findings and Results give the data that has been collected, while the Discussion argues that the results lead to the clearly expressed conclusion, with any Limitations taken into account.
-The chapters are linked in order to connect the ideas. The purpose of the dissertation or thesis report must be made clear and the reader must be able to follow its development. \ No newline at end of file
+The chapters are linked in order to connect the ideas. The purpose of the dissertation or thesis report must be made clear and the reader must be able to follow its development.
+
+\vline
+
+\ntv. \ntv\ performs biased random walks on the
+graph and embeds nodes commonly appearing together in them
+close in the embedding space. Of the various hyperparameters
+of the method (which include walk length, context size and bias
+weights), we analyze the effect of bias weights on performance
+and adopt commonly-used values for the remaining hyperparame-
+ter [29] , namely, context size of 10 and walk length of 80. node2vec
+has two bias weights: (a) inout parameter ( q ), which controls the
+likelihood of random walk to go further away from the incoming
+node (higher values favor closer nodes), and (b) return parame-
+ter ( p ), which weighs the return probability (lower values favor re-
+turn). Lower values of q would help capture longer distances be-
+tween nodes and aims towards preserving structural equivalence.
+Fig. 13 illustrates the effect on PPI and HEPTH. In node classifica-
+tion ( Fig. 13 c) we observe that low q values help achieve higher ac-
+curacy suggesting that capturing structural equivalence is required
+to accurately embed the structure of the graph. On the contrary,
+high q values favor link prediction ( Fig. 13 b) following intuition
+that capturing community structure is crucial for the task. We
+make similar observations for SBM in Fig. 14 for the task of link
+prediction. MAP increases with increasing q until it reaches an op-
+timal. We also note that the optimal values of q in SBM are much
+higher as the graph has strong community structure. \ No newline at end of file
diff --git a/latex/thesis/common_thesis_header.tex b/latex/thesis/common_thesis_header.tex
index 36f9eb4..0ee736d 100644
--- a/latex/thesis/common_thesis_header.tex
+++ b/latex/thesis/common_thesis_header.tex
@@ -8,16 +8,25 @@
\displaywidowpenalty = 10000
% notes
-\newcommandx{\inlineToDo}[2][1=]{\todo[inline,linecolor=blue,backgroundcolor=blue!25,bordercolor=blue,#1]{#2}}
-\newcommandx{\unsure}[2][1=]{\todo[linecolor=red,backgroundcolor=red!25,bordercolor=red,#1]{#2}}
-\newcommandx{\change}[2][1=]{\todo[linecolor=blue,backgroundcolor=blue!25,bordercolor=blue,#1]{#2}}
-\newcommandx{\info}[2][1=]{\todo[linecolor=OliveGreen,backgroundcolor=OliveGreen!25,bordercolor=OliveGreen,#1]{#2}}
-\newcommandx{\improvement}[2][1=]{\todo[linecolor=Plum,backgroundcolor=Plum!25,bordercolor=Plum,#1]{#2}}
-\newcommandx{\thiswillnotshow}[2][1=]{\todo[disable,#1]{#2}}
+\newcommandx{\inlineToDoT}[2][1=]{\todo[inline,linecolor=blue,backgroundcolor=blue!25,bordercolor=blue,#1]{#2}}
+\newcommandx{\unsureT}[2][1=]{\todo[linecolor=red,backgroundcolor=red!25,bordercolor=red,#1]{#2}}
+\newcommandx{\changeT}[2][1=]{\todo[linecolor=blue,backgroundcolor=blue!25,bordercolor=blue,#1]{#2}}
+\newcommandx{\infoT}[2][1=]{\todo[linecolor=OliveGreen,backgroundcolor=OliveGreen!25,bordercolor=OliveGreen,#1]{#2}}
+\newcommandx{\improvementT}[2][1=]{\todo[linecolor=Plum,backgroundcolor=Plum!25,bordercolor=Plum,#1]{#2}}
+\newcommandx{\thiswillnotshowT}[2][1=]{\todo[disable,#1]{#2}}
+
+% extra, so TexStudio recognizes them
+\newcommand{\inlineToDo}[1]{\inlineToDoT{#1}}
+\newcommand{\unsure}[1]{\unsureT{#1}}
+\newcommand{\change}[1]{\changeT{#1}}
+\newcommand{\info}[1]{\infoT{#1}}
+\newcommand{\improvement}[1]{\improvementT{#1}}
+\newcommand{\thiswillnotshow}[1]{\thiswillnotshowT{#1}}
% custom commands
\newcommand{\link}[1]{\href{#1}{#1}}
% shorthands
-\newcommand{\nk}{\textit{NetworKit}\ }
-\newcommand{\ntv}{\textit{node2vec}\ } \ No newline at end of file
+\newcommand{\nk}{\textit{NetworKit}}
+\newcommand{\tf}{\textit{TensorFlow}}
+\newcommand{\ntv}{\textit{node2vec}} \ No newline at end of file