Help - How to use AVOBMAT

To foster research into COVID-19 virus, we make the COVID-19 Open Research Dataset created by the Allen Institute for Artificial Intelligence available in a simplified, beta version of AVOBMAT.

How can AVOBMAT help researchers and decision-makers?

It can mainly help with:

  • Staying up-to-date on the rapidly growing COVID-19 research
  • Text mining and analysing the articles
  • Identifying collaborators

The COVID-19 Open Research Dataset

  • Comprehensive database
  • 51000 articles (38 000 full texts) as of 18 April
  • Papers and pre-prints from PubMed, Elsevier, MedRxiv, BioRxiv etc.
  • Growing every week
  • We do our best to upload the latest version in AVOBMAT as soon as possible.

1. Objective & background information

The objective of AVOBMAT

Critical and interactive analysis of bibliographic metadata AND texts with data-driven and NLP methods supported by AI techniques in a number of languages

Some background information

  • AVOBMAT was primarily designed for digital humanities research.
  • In the COVID-19 version of AVOBMAT the upload, different preprocessing functions and lexical diversity analysis are not available.
  • This slidehow only introduces some basic functions.
  • The release of AVOBMAT was planned for December 2020.
  • It has not been widely tested by the public.
  • Hosted on a virtual machine with basic parameters. 

2. How to search the COVID-19 database?

In AVOBMAT the different type of searches are:

  • Faceted search
  • Date (range) search
  • Advanced search
  • Commond line search

TIP: To start a new search you should reload the page (F5).

Faceted search

  • To reduce the number of search results at the moment you can select ONE item within a facet at a time. The facet then disappears.
  • Your selection(s) is / are shown under Selection criteria. By clicking on the X the selection is removed.
  • To see more items, you can use Show more.

Dates

You can specify the Date or Date range of the publications by using On, Before, After & Between.

Searching by date

Please note that the months and days of the publications are not provided in all the articles in the original COVID-19 database.

Advanced search

  1. Select the Metadata Field (e.g Entire document).
  2. Type your search term.
  3. Add new rows by the + sign.
  4. Connect them with AND / OR / NOT.
  5. Click Search.
Advanced search in AVOBMAT

Proximity search

It helps you find words within a specific word distance.

The number (N) specifies that one word must occur within N words of the other in the documents.

Example: if the value of proximity is set to 10 and you enter vaccine immunity in the Entire document field, the search results will have documents in which the terms vaccine and immunity appear within 10-word distance.

If the Order option remains ticked, vaccine precedes immunity in the search results. 

Proximity search in AVOBMAT

TIP: you can consult the explanation of the different functions by moving your mouse on information icon.

Command line search (Lucene syntax)

You can formulate complex queries by using the Lucene syntax.

It includes the use of wildcards (e.g. *,?) and regular expressions.

You can use the abbreviated versions of the metadata fields in the queries as listed on the right.

Example:

(YR:[2017 TO 2020]) AND (FT:chloroquine OR FT:ivermectin) AND AB:coronavirus*

Metadata field

Abbreviation

Authors

AU

Publication Year

YR

Title

TI

Journal

PUB

Abstract Note

AB

Entire document

FT

Detected language

DLA

Search results

  • By clicking on an article in the Search results, you can see all the metadata fields such as Abstracts and the full texts of the processed articles (see the next slide) if they are available.
  • By using the Url, DOI etc you can easily find the original version of the given article on the internet.
  • IMPORTANT: All the content and metadata analysis & visualizations will be based on the search results that you selected by the different search functions introduced so far.

TIP: If needed, click Auto-format to display the full text in a more reader-friendly format.

3. Keyword in context

  • Keyword in context helps you to read the context of your search terms in all the articles in your Search result.
  • You can save time as you do not have to read the entire texts of numerous articles.
  • You can set up the length of the context and the number of documents to be displayed.

Tip: you can rearrange the results by Publication years (ascending / descending) and Authors (alphabetical). 

4. Frequency analysis & word clouds

You can perform three different types of analysis:

  • Significant text
  • Tagspheres
  • Wordcount

TIP: The results are displayed in Bar charts too.

For this just click “Bar chart” in the top left corner.

Significant text

The significant text analysis & visualization highlights the most related terms to a special query.

If you filter the COVID-19 database, for example, by a keyword search in the Abstracts, this tool highlights the words that are most strongly related to this selected subset of documents compared to the entire COVID-19 database: what are the unique words characteristic of this subset?

You can set the following parameters:

  • 4 metric types
  • Max. number of words in the wordcloud
  • Sample size: number of documents from your subset of documents

Example:

  • Search term: Chloroquine in the Entire documents
  • This wordcloud shows the most strongly related words to the 493 articles mentioning the word chloroquine compared to the other articles in the COVID-19 database.

Bar chart view of significant text:

TIP: you can export the data and the wordclouds.

TagSpheres

  • Tagspheres creates tag clouds showing the co-occurring words of a specified search term in a corresponding word distance.
  • Besides the search term, you can specify the minimum frequency and the maximum distance of the co-occurring words.
  • You can also set if you intend to analyse the co-occuring words only before or after the provided search term.

Example:

This wordcloud shows the words occuring within 3-word distance of the term chloroquine.

TIP: you can switch to “Bar chart” and export the results.

N.B. In future releases stopwords (e.g. the, and) will be removed.

Word count

This visualization shows the most frequent words in your filtered documents.

Example:

Most frequent words in 493 articles mentioning the word chloroquine.

5. Ngram viewer

  • The Ngram Viewer shows the yearly count of the specified ngrams.
  • Ngram is a sequence of words, where N represents the number of words. Example: social distancing is a 2-gram (bigram).
  • In the AVOBMAT-COVID-19 database N is limited to 3 and the all the words are lowercased.
  • The Ngram Viewer also provides a normalized view where the count of the ngrams are divided by the count of all words in the given years.
  • The COVID-19 Open Research Dataset in AVOBMAT contains most recent articles on the new coronavirus AND related historical research on other coronavirus outbreaks and epidemics.
  • Ngram viewer helps you to have a diachronic / chronological overview of the distribution of your search terms in the scholarly literature over time.

Example:

TIP: if you move your mouse to a particular point on the functions, it will dispay the yearly count of your search term or the percentage in the normalized view.

Example of normalised view:

TIP: You can export the image and the data and move back to aggregated view:

6. Topic modelling

  • Cluster documents in semantic groups
  • Find hidden semantic information
  • Statistical methods are used to discover the themes that are embedded in the texts &
  • Reveal the connections of these themes and their changes over time.
  • Useful tool in the COVID-19 database because, for example, the articles do not come with keywords, manual or automatic tags.

Parameters

The parameters include:

  • Minimum frequency of words: the most frequent and rarest words are no real use for topic modeling.
  • The number of topics: how manysemantic groups should be formed.
  • The number of iterations: how longthe model should learn from the data. You should press RUN to start the analysis.
  • The ideal number of topics depends on what you are looking for in the model.
  • For example, if you set the number of topics to 10, it gives a broad overview of the articles selected by the different search functions in the Home menu.

 

Example:

Search query:

  • “malaria” in Abstracts
  • language: English
  • Articles has full text
  • Result: 183 articles

Topic modeling parameters:

  • Number of topics: 20
  • Iterations: 200

Here are the results in the form of 20 topic clusters:

[0] antibody cell protein parasite sequence blood bind red igg gene
[1] sample method dna pcr time parasite positive detection lamp detect
[2] protein serum figure analysis control malaria expression perform identify table
[3] patient malaria infection day diagnosis treatment severe fever test blood
[4] vaccine disease infection case country cause death vaccination health immunization
[5] disease water increase human change mosquito climate affect cholera cause
[6] china health africa aids international global hiv development program chinese
[7] clinical blood product development study trial use technology potential need
[8] health public disease country surveillance system datum information control include
[9] vaccine response cell antigen vector immune adjuvant protein induce specific
[10] have stillbirth road ebola network municipality dengue node community epidemic
[11] malaria case report control area study datum high population numb
[12] compound acid activity hcv fig peptide enzyme active structure amino
[13] cell virus chloroquine infection treatment viral effect host infect drug
[14] disease study research infectious identify pathogen analysis country laboratory publish
[15] child risk high age low year associate fever rate health
[16] mouse fig lung parasite level study show day live animal
[17] model disease numb human population infect individual infection transmission mosquito
[18] travel infection traveler fever case return disease traveller include dengue
[19] species host include occur range common parasite infection tissue blood

Please note that the WORDS in the original articles are LEMMATIZED (dictionary form). Example: patients become patient.

Interpretations of some topic clusters:

[1] sample method dna pcr time parasite positive detection lamp detect

This topic refers to methods such as PCR and LAMP to detect malaria by DNA amplification.

[3] patient malaria infection day diagnosis treatment severe fever test blood

Articles in this cluster are related to the treatment and diagnosis of malaria among patients having fever.

 

You can see the list of articles belonging to each topic cluster by clicking on Topic documents.

The percentage in sqaure brackets shows the probability of the selected topic (bold) in the article. 

In this example several articles are related to the LAMP detective method.

 

Option: removing unnecessary words

Very frequent words are no use for topic models. You can interactively remove them. For this you should click on the Vocabulary icon.

TIP: Do NOT forget to re-run the iterations after removing the stopwords.

Topic correlations

Topics that occur together more than expected are blue, topics that occur together less than expected are red.

Time series

You can visualize the distribution of topics over time.

Aggregated:

Tip: You can interactively remove a topic by clicking on its colour.

 

Normalised:

You can display the results in normalized mode.

Tip: If you click on the point of the function, the topic with the probability value is displayed.

7. Metadata analysis & visualizations

Interactive metadata visualizations

  • Metadata includes authors, journal titles and publication years.
  • Having filtered the COVID-19 database by the different searches (Home menu), users can, among others,
  • Analyse and visualize the bibliographic data chronologically in line and area charts in normalized and aggregated formats;
  • Create an interactive network analysis of maximum three (meta)data fields;
  • Make pie, horizontal and vertical bar charts of the bibliographic data of their choice
  • Users can choose the metadata field(s) and the number of top items for visualization.

Click on “Metadata visualization” in menu bar.

Example:

Search term: chloroquine in the Abstracts

Network of chloroquine-related publications:

TIP: You can download the network and the data.

 

Top five authors in the COVID-19 dataset in 2020 and the journals where they published their papers:

N.B. Missing values in the Authors and Journal fields were excluded.

You have reached the end of the AVOBMAT help page. If you have any more questions, don’t hesitate to contact us.