**diff options**

-rw-r--r-- | Antrag_Zulassung_Abschlussarbeit.pdf | bin | 0 -> 706821 bytes | |||

-rw-r--r-- | Merkblatt Abschlussarbeit Info.pdf | bin | 0 -> 54963 bytes | |||

-rw-r--r-- | Zeitplan.entwurf | 21 | ||||

-rw-r--r-- | code/python/generate_graphs.py | 1 | ||||

-rw-r--r-- | latex/proposal/Proposal.pdf | bin | 200424 -> 223610 bytes | |||

-rw-r--r-- | latex/proposal/Proposal.tex | 55 | ||||

-rw-r--r-- | latex/proposal/bibliography/bibliography.bib | 124 | ||||

-rw-r--r-- | latex/thesis/Institutsvorlage.tex | 2 | ||||

-rw-r--r-- | latex/thesis/Thesis.pdf | bin | 258585 -> 260062 bytes | |||

-rw-r--r-- | latex/thesis/common_thesis_header.tex | 11 |

10 files changed, 127 insertions, 87 deletions

diff --git a/Antrag_Zulassung_Abschlussarbeit.pdf b/Antrag_Zulassung_Abschlussarbeit.pdf Binary files differnew file mode 100644 index 0000000..531c85c --- /dev/null +++ b/Antrag_Zulassung_Abschlussarbeit.pdf diff --git a/Merkblatt Abschlussarbeit Info.pdf b/Merkblatt Abschlussarbeit Info.pdf Binary files differnew file mode 100644 index 0000000..6189a5d --- /dev/null +++ b/Merkblatt Abschlussarbeit Info.pdf diff --git a/Zeitplan.entwurf b/Zeitplan.entwurf new file mode 100644 index 0000000..7df8b29 --- /dev/null +++ b/Zeitplan.entwurf @@ -0,0 +1,21 @@ +Vier Monate + +Woche 1: +Woche 2: +Woche 3: +Woche 4: +Woche 5: +Woche 6: +Woche 7: +Woche 8: +Woche 9: +Woche 10: +Woche 11: +Woche 12: +Woche 13: Aufschreiben +Woche 14: Aufschreiben +Woche 15: Aufschreiben +Woche 16: Finalisierungen (Puffer) + + +electrical closeness centrality, a. k. a. current-flow closeness or information centrality
\ No newline at end of file diff --git a/code/python/generate_graphs.py b/code/python/generate_graphs.py index 6b9d7f2..f175518 100644 --- a/code/python/generate_graphs.py +++ b/code/python/generate_graphs.py @@ -45,6 +45,7 @@ for i in range(0, number_of_graphs): embedding.run() features = embedding.getFeatures() flat_features = sum(features, []) + # --> len(flat_features) = 25600 # store features with open(target_dir + features_file, "w", encoding='utf-8') as fp: diff --git a/latex/proposal/Proposal.pdf b/latex/proposal/Proposal.pdf Binary files differindex fa28b37..bc17b88 100644 --- a/latex/proposal/Proposal.pdf +++ b/latex/proposal/Proposal.pdf diff --git a/latex/proposal/Proposal.tex b/latex/proposal/Proposal.tex index dffc1e3..c0b75cc 100644 --- a/latex/proposal/Proposal.tex +++ b/latex/proposal/Proposal.tex @@ -10,17 +10,18 @@ \usepackage[dvipsnames]{xcolor} \usepackage[a4paper, total={16cm, 24cm}]{geometry} \usepackage[colorinlistoftodos,prependcaption,textsize=footnotesize]{todonotes} +\usepackage{multirow} \usepackage[backend=biber, %% Hilfsprogramm "biber" (statt "biblatex" oder "bibtex") - style=alphabetic, %% Zitierstil + style=numeric, %% Zitierstil (alphabetic) natbib=true, %% Bereitstellen von natbib-kompatiblen Zitierkommandos hyperref=true, %% hyperref-Paket verwenden, um Links zu erstellen - sorting=ynt, + sorting=none, %% YearNameTitle, none --> citation order (best with style numeric) ]{biblatex} \addbibresource{bibliography/bibliography.bib} \include{common_thesis_header} -\title{\titletext\ - Proposal} +\title{\titletext\\Proposal} \author{Niklas Halle} \begin{document} @@ -36,40 +37,20 @@ Several articles and papers are dealing with graph embedding and its application. A good example is \cite{GOYAL201878}, a survey on a collection of different graph embedding techniques, evaluating their applications and performances. It also includes the random walked based \ntv, which will be the base in this thesis. They find that “Choosing the right balance enables \ntv\ to preserve community structure as well as structural equivalence between nodes” but also that “Embeddings learned by \ntv\ have low reconstruction precision,” “\ntv\ outperforms other methods on the task of node classification,” and “\ntv\ preserves homophily as well as structural equivalence between nodes [which] suggest this can be useful in node classification”. Therefore comparing it with “traditional” algorithms for clustering in \nk\ (which in essence is node classification) is evident. Another interesting study on embedding and clustering is \cite{rozemberczki2019gemsec}, which is quite different though, as it does the embedding and the clustering step at once, not in sequence based on the embedding. On the other hand, there are few data on the effectiveness of \ntv\ for approximating different centrality scores. Therefore we will also evaluate two of theses to see if an approximation of specific numeric properties of graphs might also be feasible using this approach in \nk. \section{Method} - The rough idea is to find a good set of synthetic and real graphs for which clusters can either be well retrieved or are already known. For the same graphs, we want to have a small set of features besides clustering that can be accurately computed and have not been (a major) part of previous studies to see how an \ntv-embedding + machine learning approach might perform on these. We will then use the machine learning framework \tf\unsure{or scikit-learn, to be evaluated}, which we plan to use to design, train and evaluate the machine learning-based part. One deep neural network for each of the compared features is planned. The machine learning part results will then be compared with those of “traditional” algorithms already provided by \nk. For the community detection comparison, we plan to use an LFR\footnote{Lancichinetti-Fortunato-Radicchi} benchmark. - \subsection{Time frame} - \inlineToDo{Estimate time} - \begin{enumerate} - \item Collect graphs - \begin{enumerate} - \item By means of (synthetic) graph generation - \item By means of selection of natural graphs - \end{enumerate} - \item Calculation of the selected centrality values and cluster sets --> Test and training data - \begin{enumerate} - \item Develop data format (graph+expected value, set of such pairs) - \item Build an algorithm that takes a graph and outputs graph+expected value / adds to such a set. - \end{enumerate} - \item Familiarise with \tf - \item Train neural network - \begin{enumerate} - \item Separate data into training and test data sets - \item Designing the network - \begin{enumerate} - \item Embed the graphs, determine the dimensions (in order to determine the size (of the first layer) of the neural network [hyper parameter, is then fixed]). \info{Is this fixed in size independent of graph size?} - \item Network design: layer-, node count - \end{enumerate} - \item Training - \begin{enumerate} - \item ??? - \end{enumerate} - \end{enumerate} - \item Testing and evaluation - \item ...\improvement{Probably need some more steps here...} - \item Compare results - \item Evaluate comparison - \item Outlook - \end{enumerate} + The rough idea is to find a good set of synthetic and real graphs for which clusters can either be well retrieved or are already known. For the same graphs, we want to have a small set of features besides clustering that can be accurately computed and have not been (a major) part of previous studies to see how an \ntv-embedding + machine learning approach might perform on these. We will then use the machine learning framework \tf, which we plan to use to design, train and evaluate the machine learning-based part. One deep neural network for each of the compared features is planned. The machine learning part results will then be compared with those of “traditional” algorithms already provided by \nk. For the community detection comparison, we plan to use an LFR\footnote{Lancichinetti-Fortunato-Radicchi} benchmark. + + \subsection{Estimated time frame} + \begin{tabular}{c|l} + \textbf{Week} & \textbf{Goal} \\\hline + 1. \& 2. & \mlcellL{Generate (small) test graphs using LFR\footnote{Lancichinetti–Fortunato–Radicchi}-benchmark\\Program graph --> \{graph, target values\} (trainings and test data for the DNN)}\\\hline + 3. \& 4. & \mlcellL{Create and evaluate different DNNs using week ones graphs,\\and decide on one architecture (for each property)}\\\hline + 5. \& 6. & \mlcellL{Build training pipeline:\\\ graphs --(via traditional algorithms)--> test data --(training)--> trained deep neural network\\Build evaluation pipeline:\\\ graph + trained dnn --> exact/benchmark solution + dnn-solution\\\quad(including timing and difference)}\\\hline + 7. \& 8. & \mlcellL{Gather real world graphs with (good) ground truth for clustering and generate more\\(not as small) synthetic graphs\\Generate test data (i.e., calculate solutions for the centralities) and split into train- and validation\\sets}\\\hline + 9. \& 10. & Train networks, start evaluation with as soon as one is done \\\hline + 11. \& 12. & Gather and evaluate data\\\hline + 13., 14. \& 15. & Writing \\\hline + 16. & Finalisation (buffer) + \end{tabular} \newpage \nocite{*} diff --git a/latex/proposal/bibliography/bibliography.bib b/latex/proposal/bibliography/bibliography.bib index ff07091..0980dea 100644 --- a/latex/proposal/bibliography/bibliography.bib +++ b/latex/proposal/bibliography/bibliography.bib @@ -1,25 +1,32 @@ -@inproceedings{rozemberczki2019gemsec, - title={GEMSEC: Graph Embedding with Self Clustering}, - author={Rozemberczki, Benedek and Davies, Ryan and Sarkar, Rik and Sutton, Charles}, - booktitle={Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2019}, - pages={65-72}, - year={2019}, - organization={ACM} +% web source (with link) +@misc{grover2016node2vec, + title={node2vec: Scalable Feature Learning for Networks}, + author={Aditya Grover and Jure Leskovec}, + year={2016}, + eprint={1607.00653}, + archivePrefix={arXiv}, + primaryClass={cs.SI} } -@inproceedings{tang2015line,
- title={LINE: Large-scale Information Network Embedding.},
- author={Tang, Jian and Qu, Meng and Wang, Mingzhe and Zhang, Ming and Yan, Jun and Mei, Qiaozhu},
- booktitle={WWW},
- year={2015},
- organization={ACM}
+@book{DBLP:reference/snam/2018, + author = {D. A. Bader and A. Kappes and H. Meyerhenke and P. Sanders and C. Schulz and D. Wagner}, + title = {Benchmarking for Graph Clustering and Partitioning.}, + year = {2018}, + booktitle = {Encyclopedia of Social Network Analysis and Mining, 2nd Edition}, + publisher = {Springer}, + url = {https://www.springer.com/de/book/9781493971305} } -@article{clusterings, - author = {Wagner, Silke and Wagner, Dorothea}, - year = {2007}, - month = {01}, - pages = {}, - title = {Comparing Clusterings - An Overview}, - journal = {Technical Report 2006-04} +@article{GOYAL201878, + title = {Graph embedding techniques, applications, and performance: A survey}, + journal = {Knowledge-Based Systems}, + volume = {151}, + pages = {78 - 94}, + year = {2018}, + issn = {0950-7051}, + doi = {https://doi.org/10.1016/j.knosys.2018.03.022}, + url = {http://www.sciencedirect.com/science/article/pii/S0950705118301540}, + author = {Palash Goyal and Emilio Ferrara}, + keywords = {Graph embedding techniques, Graph embedding applications, Python graph embedding methods GEM library}, + abstract = {Graphs, such as social networks, word co-occurrence networks, and communication networks, occur naturally in various real-world applications. Analyzing them yields insight into the structure of society, language, and different patterns of communication. Many approaches have been proposed to perform the analysis. Recently, methods which use the representation of graph nodes in vector space have gained traction from the research community. In this survey, we provide a comprehensive and structured analysis of various graph embedding techniques proposed in the literature. We first introduce the embedding task and its challenges such as scalability, choice of dimensionality, and features to be preserved, and their possible solutions. We then present three categories of approaches based on factorization methods, random walks, and deep learning, with examples of representative algorithms in each category and analysis of their performance on various tasks. We evaluate these state-of-the-art methods on a few common datasets and compare their performance against one another. Our analysis concludes by suggesting some potential applications and future directions. We finally present the open-source Python library we developed, named GEM (Graph Embedding Methods, available at https://github.com/palash1992/GEM), which provides all presented algorithms within a unified interface to foster and facilitate research on the topic.} } @inproceedings{bcaprox, author = {Maurya, Sunil Kumar and Liu, Xin and Murata, Tsuyoshi}, @@ -38,40 +45,61 @@ location = {Beijing, China}, series = {CIKM '19} } -@article{GOYAL201878,
- title = {Graph embedding techniques, applications, and performance: A survey},
- journal = {Knowledge-Based Systems},
- volume = {151},
- pages = {78 - 94},
- year = {2018},
- issn = {0950-7051},
- doi = {https://doi.org/10.1016/j.knosys.2018.03.022},
- url = {http://www.sciencedirect.com/science/article/pii/S0950705118301540},
- author = {Palash Goyal and Emilio Ferrara},
- keywords = {Graph embedding techniques, Graph embedding applications, Python graph embedding methods GEM library},
- abstract = {Graphs, such as social networks, word co-occurrence networks, and communication networks, occur naturally in various real-world applications. Analyzing them yields insight into the structure of society, language, and different patterns of communication. Many approaches have been proposed to perform the analysis. Recently, methods which use the representation of graph nodes in vector space have gained traction from the research community. In this survey, we provide a comprehensive and structured analysis of various graph embedding techniques proposed in the literature. We first introduce the embedding task and its challenges such as scalability, choice of dimensionality, and features to be preserved, and their possible solutions. We then present three categories of approaches based on factorization methods, random walks, and deep learning, with examples of representative algorithms in each category and analysis of their performance on various tasks. We evaluate these state-of-the-art methods on a few common datasets and compare their performance against one another. Our analysis concludes by suggesting some potential applications and future directions. We finally present the open-source Python library we developed, named GEM (Graph Embedding Methods, available at https://github.com/palash1992/GEM), which provides all presented algorithms within a unified interface to foster and facilitate research on the topic.} -} -@misc{grover2016node2vec, - title={node2vec: Scalable Feature Learning for Networks}, - author={Aditya Grover and Jure Leskovec}, - year={2016}, - eprint={1607.00653}, +@misc{angriman2021approximation, + title={Approximation of the Diagonal of a Laplacian's Pseudoinverse for Complex Network Analysis}, + author={Eugenio Angriman and Maria Predari and Alexander van der Grinten and Henning Meyerhenke}, + year={2021}, + eprint={2006.13679}, archivePrefix={arXiv}, - primaryClass={cs.SI} + primaryClass={cs.DS} } -@book{DBLP:reference/snam/2018, - author = "D. A. Bader and A. Kappes and H. Meyerhenke and P. Sanders and C. Schulz and D. Wagner", - title = "Benchmarking for Graph Clustering and Partitioning.", - year = 2018, - booktitle = "Encyclopedia of Social Network Analysis and Mining, 2nd Edition", - publisher = "Springer", - url = "https://www.springer.com/de/book/9781493971305" +% web source (without link) +@inproceedings{tang2015line, + title={LINE: Large-scale Information Network Embedding.}, + author={Tang, Jian and Qu, Meng and Wang, Mingzhe and Zhang, Ming and Yan, Jun and Mei, Qiaozhu}, + booktitle={WWW}, + year={2015}, + organization={ACM} +} + +% articel sources +@inproceedings{rozemberczki2019gemsec, + title={GEMSEC: Graph Embedding with Self Clustering}, + author={Rozemberczki, Benedek and Davies, Ryan and Sarkar, Rik and Sutton, Charles}, + booktitle={Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2019}, + pages={65-72}, + year={2019}, + organization={ACM} +} + +% book sources +@article{clusterings, + author = {Wagner, Silke and Wagner, Dorothea}, + year = {2007}, + month = {01}, + pages = {}, + title = {Comparing Clusterings - An Overview}, + journal = {Technical Report 2006-04} +} +@InProceedings{10.1007/978-3-540-31856-9_44,
+ author="Brandes, Ulrik
+ and Fleischer, Daniel",
+ editor="Diekert, Volker
+ and Durand, Bruno",
+ title="Centrality Measures Based on Current Flow",
+ booktitle="STACS 2005",
+ year="2005",
+ publisher="Springer Berlin Heidelberg",
+ address="Berlin, Heidelberg",
+ pages="533--544",
+ abstract="We consider variations of two well-known centrality measures, betweenness and closeness, with a different model of information spread. Rather than along shortest paths only, it is assumed that information spreads efficiently like an electrical current. We prove that the current-flow variant of closeness centrality is identical with another known measure, information centrality, and give improved algorithms for computing both measures exactly. Since running times and space requirements are prohibitive for large networks, we also present a randomized approximation scheme for current-flow betweenness.",
+ isbn="978-3-540-31856-9"
} -% graphs +% graph sources @inproceedings{karateclub, - title = {{Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs}}, + title = {Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs}, author = {Benedek Rozemberczki and Oliver Kiss and Rik Sarkar}, year = {2020}, pages = {3125–3132}, diff --git a/latex/thesis/Institutsvorlage.tex b/latex/thesis/Institutsvorlage.tex index fe9e8c9..094d190 100644 --- a/latex/thesis/Institutsvorlage.tex +++ b/latex/thesis/Institutsvorlage.tex @@ -96,7 +96,7 @@ geboren am: & {\@titelGeburtsdatum}\\ geboren in: & \@titelGeburtsort \vspace{0.5\baselineskip}\\ - Gutachter*innen: & \@titelGutachterA \\ + Gutachter: & \@titelGutachterA \\ & \@titelGutachterB \vspace{0.5\baselineskip}\\ eingereicht am: & \@titelEinreichungsdatum \hfill \@titelVerteidigungsdatum diff --git a/latex/thesis/Thesis.pdf b/latex/thesis/Thesis.pdf Binary files differindex 9fa6c00..ac4642a 100644 --- a/latex/thesis/Thesis.pdf +++ b/latex/thesis/Thesis.pdf diff --git a/latex/thesis/common_thesis_header.tex b/latex/thesis/common_thesis_header.tex index f05a5d8..e77774f 100644 --- a/latex/thesis/common_thesis_header.tex +++ b/latex/thesis/common_thesis_header.tex @@ -7,6 +7,9 @@ \widowpenalty = 10000 \displaywidowpenalty = 10000 +% little higher array/table rows +\renewcommand{\arraystretch}{1.3} + % notes \newcommandx{\inlineToDoT}[2][1=]{\todo[inline,linecolor=blue,backgroundcolor=blue!25,bordercolor=blue,#1]{#2}} \newcommandx{\unsureT}[2][1=]{\todo[linecolor=red,backgroundcolor=red!25,bordercolor=red,#1]{#2}} @@ -23,11 +26,17 @@ \newcommand{\improvement}[1]{\improvementT{#1}} \newcommand{\thiswillnotshow}[1]{\thiswillnotshowT{#1}} +% various qol commands +\newcommand{\mlcell}[2][c]{% + \begin{tabular}[#1]{@{}c@{}}#2\end{tabular}} +\newcommand{\mlcellL}[2][l]{% +\begin{tabular}[#1]{@{}l@{}}#2\end{tabular}} + % custom commands \newcommand{\link}[1]{\href{#1}{#1}} % shorthands -\newcommand{\titletext}{Comparison of graph representations and traditional graph algorithms} +\newcommand{\titletext}{Comparing Graph Representation-Based Machine Learning with Traditional Graph Algorithms} \newcommand{\nk}{\textit{NetworKit}} \newcommand{\tf}{\textit{TensorFlow}} \newcommand{\ntv}{\textit{node2vec}}
\ No newline at end of file |