Topic Modeling

Identifying regional characteristics of transportation research with Transport Research International Documentation (TRID) data

Keywords:  Data mining, Topic modeling, Unsupervised Machine Learning

Github: https://github.com/skkirtonia/topic_modeling

Article link: Click here to read the article

Python libraries used for -- 

Data analysis: Pandas, Numpy

Plotting: Seaborn, Matplotlib, wordcloud, networkx

Topic modeling: scikit-learn

XML parsing: xml (ElementTree)

Network request: urllib, BeautifulSoup (for parsing)

Natural Language Procesing: Spacy, nltk, langdetect, pycountry

Objective

Transport Research International Documentation (TRID) is the world’s largest transportation research bibliographic database. More than 1.2 million records of references to books, project reports, journal and conference papers are contained in TRID. This research aims to identify the regional characteristics of transportation research using the bibliographic information of the published research articles and papers.


Example research questions: 




Data collection:

The bibliographic information for the articles and papers between 2008-2018 from https://trid.trb.org/.

Total number of documents: 257,225

Attributes: Abstract, Conference, Conference Location, EISSN, Geographic Term, ISSN, Index Term, Issue, Language, Number of Authors, Publication Year, Published On, Publisher, Record ID, Subject Area, Title, Volume, Type.

See the steps

Data cleaning and transformation:

Python libraries used: xml (ElementTree), pandas

Source code:  Github

Python libraries used: urllib, BeautifulSoup

Source code: Github

a. The abstract is less than 100 characters or missing.

b. The journal is not SCI, SSCI or EI.

c. The language of the abstract is other than English.

Python libraries used: langdetect, Spacy

Source code: Github

a. Removing words other than noun, adjective, verb, adverb, proper noun.

b. Lemmatization is used to drop unnecessary characters, usually a suffix and to keep only the base dictionary form of a word, e.g., walking to walk.

c. Standardizing the name of the country and states of the USA. 

d. Detecting any geographic information from the abstract.

e. Removed words that appear in more than 60% of the abstracts and less than 50 abstracts. 

Python libraries used: Spacy, pycountry, scikit-learn, nltk

Source code: Github

Some statistics with cleaned data

The average number of publications each year is around 14,000

The number of publications of the top 30 countries are shown in this figure. The number of papers associated with the U.S. is very large in relative to other followers such as China, Australia and Canada.

Shows the distribution of papers over selected journals (top 30). 

The journal with the most papers is Transportation Research Record (TRR)

Although TRR has the largest number of papers, TRB, the publisher of TRR, ranks the third among all publishers by the number of papers.

Two other publishers, namely Elsevier and American Society of Civil Engineers (ASCE), lead the ranking

Model building

Latent Dirichlet Allocation (LDA): LDA is one of the most popular methods for performing topic modeling. The aim behind the LDA is to find topics that the document belongs to, on the basis of the words contained in it. It assumes that documents with similar topics will use a similar group of words. This enables the documents to map the probability distribution over latent topics and topics are probability distribution.

Number of topics: 50 (the number is chosen based on other transportation bibliographic analyses with a comparable scale in the literature)


Model input: A (146972 x 2193) matrix where each row indicates each document and each column indicates each unique word.



Example input matrix

Model output

The distribution of words in topic: for each topic, the weights of all words are found. The word with higher weights represents the topic.

The distribution of topics in the documents: For each document, we get the probability distribution of all topics. In this case, the topic with the highest weight is the topic of the document

LDA model output

Results

Topic-word distribution

50 topics are manually inspected and given appropriate names. There are some topics that do not present transportation research topics and usually appear together during academic writing. Such topics are named  'academic words' for topics 27, 34, and 49.

Word cloud for topics 0–24

Word cloud for topics 25-49


Distinct topics vs. overlapping topics

There is no overlap between the two groups of major words for T40 Soil and pile foundation and T47 Asphalt.


Some keywords overlap between T0 Driving simulation and T42 Driving behavior. The sensitivity analyses show that when the number of topics is reduced from 50 to 46, such topics with a substantial overlap are merged.

Research topic proportion among various studies

This figure shows the proportion of topics in all documents. 

Topic 27, 34, and 49 are some of the top topics. These topics consist of academic words. Therefore, those words together create clusters. 

Topic 37: Transportation sustainability is found to be the popular topic.

Research topic distributions for selected countries

(a) shows the topic distributions for China and Australia. Clearly, Australia is higher in T37 Transportation sustainability and T42 Driving behavior, while China is higher in T5 Optimization model and T8 Freight port.

(b) shows three European countries associated with similar topic distributions

(c) demonstrates the strong similarity between Australia and New Zealand

Countries with similar research topics

The figure shows the dendrogram for 30 countries/regions based on the Euclidean distance between topic distributions. Closely connected countries have similar research topics. 


A few notable country groups are observed, such as (United States, Canada), (Japan, South Korea), (Norway, Finland), (Australia, New Zealand), and (Switzerland, Netherlands)


We find countries/regions that are geographically proximate are associated very similar research topic distributions

US States with similar research topics

The figure shows the dendrogram for 50 states based on the Euclidean distance between topic distributions. Closely connected states have similar research topics.

 

Virginia and Texas, despite being geographically remote, are associated with similar topic distributions.


California and New Jersey share quite similar topic distributions.


By contrast, two states bordered by the Gulf of Mexico, namely Florida and Louisiana, have quite different topic distributions.

EV-related research

The blue bar indicates the expected number of studies based on the proportion of EV-related research in all documents. The orange bar shows the actual number of studies found related to EV for the corresponding region.


Germany is much more likely to be a study region than France for EV-related research.

U.S. is not as likely as China to be a study region.

At the U.S. state level, California, New York, and Vermont stand out as the actual number of EV-related documents far exceeds the expected number for them.


Florida falls short of expectations.

Conclusion

Summary of the findings