Mining knowledge from inter-connectedness of documents
Text is the most concise and ubiquitous way to communicate information. It is arguably the most useful representation of our physical world. It is no surprise that much of artificial intelligence (including the current spate of innovations) rely on natural language understanding (via prompting). For machines to be able to understand language, text needs an accurate numerical representation which has seen an evolutionary change in the last decade. From one hot representations to static word embeddings from a relatively small sized corpus to vector representation of sub-tokens from gigantic internet-sized texual data (via attention mechanism), this change has been one of the most consequential for the AI revolution that we are currently a part of.
The aforementioned innovation in texual representation has enabled the storing of text documents as vectors on which mathematical operations can be applied to explore and exploit their inter-relatedness. E.g Vector search engines use cosine similarity to search the most relevant documents with respect to the query. Similarity alone drives many a useful applications including search, information retrieval and recommendations. Use of vector similarity of text, however, remains under-discussed in network analysis.
Review of the existing work
Most previous works on semantic networks incorporate words as units of analysis as opposed to entire documents. In Semantic Network Analysis for Social Sciences, Elad Segev writes,
“Semantic network analysis is more of an unsupervised approach. It is similar to topic modelling, but instead of classifying documents based on common words, it classifies words within a document based on their proximity or co-occurrences. The input in this case is a text, and the output is a network of words, clustered based on their proximity. Semantic network analysis is considered to be an unsupervised approach because there is no need to define in advance the categories, and the classification of words into categories is done automatically based on their co-occurrences within the text.”
Some other works in the area include “A network approach to topic models” (by Tiago, Eduardo and Altmann) that details what it calls the cross-fertilization between topic models and community detection (used in network analysis). (This paper is closer to what is described here but uses an entirely different approach wherein “the nodes consists of documents and words, and the strength of the edge between them is given by the number of occurrences of the word in the document, yielding a bipartite multigraph that is equivalent to the word-document matrix used in topic models.”).
Similarity networks are not a novel idea per se. In the book, Complex Network Analysis in Python, Dmitry Zinoviev details the subject wherein similarity measures of nodes are used to form edges in the graphs. The article discusses a similar approach on text representation of our items of interest (also referred to as nodes of the network that can be write-ups for recommendation on a content site, research papers for literature survey, or products on an e-commerce site (via their product descriptions) and the inter-connectedness between them through only the vector representation of their underlying texual descriptions.
This approach is different from the conventional text clustering in the way that the latter loses information on the inter-connectedness with the larger document space. In network analysis, this information is retained via the connections between the nodes.
Some potential applications of this approach include:
- Detection of white spaces in the whole document space — e.g Let’s say there is a collection of research papers on a given subject and you would like to explore ideas that remain understudied. You could look at the network to identify nodes that do not have a high degree in various clusters. These peripheral nodes could expose you to the potential white spaces in the literature.
- Identifying documents that are integral to bridging two concepts — Say via community detection on the semantic nodes, you have been able to identify clusters that represent certain categories or concepts. By looking at the interconnections between the two clusters of interest, you can be able to pick nodes(documents) that have maximum edges between these two clusters which could serve as potential candidates for the ‘bridge’ concepts. (This can be especially useful in literature survey where often the use of networks is limited to considering the number of references as edges. It might so happen that certain papers get more references that are not entirely related to the content of the paper (See ‘Huber, Juergen and M. Inoua, Sabiou and Kerschbamer, Rudolf and König-Kersting, Christian and Palan, Stefan and Smith, Vernon L., Nobel and Novice: Author Prominence Affects Peer Review (August 16, 2022). )
- Obviating the cold start problem in recommendations — Recommendations typically require user engagement data which is not available when you are starting out. Moreover, serving recommendations solely on similarity can lead to unintended consequences. (e.g See How Algorithmic Confounding in Recommendation Systems Increases Homogeneity and Decreases Utility (2017) by Allison J. B. Chaney, et al.). To add diversity to your recommendations, semantic similarity networks can be used. Let’s say the user clicks on a certain article. From the network, the following pseudo-code can be used:
- Identify the set of clusters C which the node (document) is connected to.
- Use a threshold for percentage of degrees to filter out the most relevant clusters (C_filtered)[example: the node is connected to cluster number 1 (of which it is a part of ) by 4 edges, 3 edges to cluster 2 and 2 edges to cluster 3, and 1 to cluster 4, a threshold of 90% is picked and 9 out of 10 edges (degrees) are accounted for by the clusters 1, 2, 3 which become part of C_filtered).
- A predefined number of recommendations are served from the most similar nodes (the connected edges of the node of interest that could be a node of any cluster).
- For the remaining recommendations, nodes are selected from C_filtered that have a high number of edges with the cluster of the node of interest but are not connected to it. Let this set be S.
- To pick the top k nodes from S, maximum marginal relevance can be used to identify nodes that are diverse set among themselves but semantically related to the cluster our node of interest is a part of.
The main challenges in using semantic similarity networks are detailed below:
- High quality document embeddings that are adapted to domain
- Choosing a suitable threshold of similarity for formation of edges
- Choosing a community detection algorithm catering to similarity networks
The third needs experimentation. My results with the conventional community detection algorithms like greedy modularity were not as good as with Agglomerative Clustering with Euclidean distance. These clusters were used to calculate the average distance scores which was further averaged to calculate a single threshold for the edges of the network. (This threshold can be considered as a hyperparameter.) Adjusting the threshold would alter the sparsity of the network.
As for document embeddings, it is a rapidly evolving field and active research is being done for the most precise domain adapted embeddings. For my experiment (detailed below), I used a set of research papers for which the most relevant pre-trained transformer is ‘allenai/specter’ , the embeddings of which did give decent results. Another of my go-to methods for domain adapted embeddings is Transformer-based Sequential Denoising Auto-Encoder by Reimers et al. The architecture is such that it introduces a bottleneck before the decoder which gives a precise document embedding (as opposed to word embeddings). This can be trained on an unlabelled dataset followed by training on a labelled dataset (preferably related to the domain of interest).
A dataset of category cs.CL on arxiv was taken for the years 2018, 2019 and 2020. (These years have also seen a phenomenal rise in natural language processing following the use of self-attention for training large language models.) As mentioned before, the document embeddings were obtained from ‘allenai/specter’ model without any fine-tuning. These embeddings were fed into sklearn’s agglomerative clustering following which 101 clusters were obtained. (The code is given below.)
The clusters were queried. For example cluster 3 had some of the following papers:
- A New Dataset for Natural Language Inference from Code-mixed Conversations
- Modeling Interpersonal Influence of Verbal Behavior in Couples Therapy Dyadic Interactions
- Multi-label Multi-task Deep Learning for Behavioral Coding
- Supervised Seeded Iterated Learning for Interactive Language Learning
- Characterizing the dynamics of learning in repeated reference games
- Predicting Behavior in Cancer-Afflicted Patient and Spouse Interactions using Speech and Language
- Alquist 3.0: Alexa Prize Bot Using Conversational Knowledge Graph
- Prevalence of code mixing in semi-formal patient communication in low resource languages of South Africa
Similarly, cluster 10, had some of the following papers.
- Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering
- Revisiting Semantic Representation and Tree Search for Similar Question Retrieval
- Lattice CNNs for Matching Based Chinese Question Answering
- RuBQ: A Russian Dataset for Question Answering over Wikidata
- Quizbowl: The Case for Incremental Question Answering
- F1 is Not Enough! Models and Evaluation Towards User-Centered Explainable Question Answering
- Meta Sequence Learning for Generating Adequate Question-Answer Pairs
- Subjective Question Answering: Deciphering the inner workings of Transformers in the realm of subjectivity
- A Quantitative Evaluation of Natural Language Question Interpretation for Question Answering Systems
A cursory glance tells us that cluster 3 is about the conversational NLP and dialogue systems and cluster 10 is about question answering. To find the peripheral papers in cluster 3, the graph is queried and following results are obtained:
- Symbol Emergence as an Interpersonal Multimodal Categorization
- Prevalence of code mixing in semi-formal patient communication in low resource languages of South Africa
- A Dialogue Annotation Scheme for Weight Management Chat using the Trans-Theoretical Model of Health Behavior Change
- Helping or Hurting? Predicting Changes in Users’ Risk of Self-Harm Through Online Community Interactions
- Does Social Support Expressed in Post Titles Elicit Comments in Online Substance Use Recovery Forums?
These look like niche specialized topics. For finding the “bridge papers” (between cluster 3 and 10), again the graph network is queried and the following result is obtained.
- Transforming Question Answering Datasets Into Natural Language Inference\n Datasets
The result is consistent with expectation as the paper talks about converting question answer pairs into declarative forms.
(The experiments with RecSys will be shared separately.)
Semantic similarity networks do offer a different way of analyzing and querying our datasets. It would be interesting to explore this with further experiments.