Topic Modeling
Identifying regional characteristics of transportation research with Transport Research International Documentation (TRID) data
Keywords: Data mining, Topic modeling, Unsupervised Machine Learning
Github: https://github.com/skkirtonia/topic_modeling
Article link: Click here to read the article
Python libraries used for --
Data analysis: Pandas, Numpy
Plotting: Seaborn, Matplotlib, wordcloud, networkx
Topic modeling: scikit-learn
XML parsing: xml (ElementTree)
Network request: urllib, BeautifulSoup (for parsing)
Natural Language Procesing: Spacy, nltk, langdetect, pycountry
Objective
Transport Research International Documentation (TRID) is the world’s largest transportation research bibliographic database. More than 1.2 million records of references to books, project reports, journal and conference papers are contained in TRID. This research aims to identify the regional characteristics of transportation research using the bibliographic information of the published research articles and papers.
Example research questions:
Do transportation issues and relevant studies in various geographic regions differ significantly?
What are some trending research topics investigated in the United States, China, or Australia over the past seven years?
Do the research topics in transportation show similarities among states that are geographically close in the USA?
Data collection:
The bibliographic information for the articles and papers between 2008-2018 from https://trid.trb.org/.
Total number of documents: 257,225
Attributes: Abstract, Conference, Conference Location, EISSN, Geographic Term, ISSN, Index Term, Issue, Language, Number of Authors, Publication Year, Published On, Publisher, Record ID, Subject Area, Title, Volume, Type.
Data cleaning and transformation:
Extract information from XML: Initially, the data is collected in batches, which is saved as XML format. The relevant information such as title, abstract, journal, conference, geographic terms, language, publisher, etc. are extracted and saved in a tabular form.
Python libraries used: xml (ElementTree), pandas
Source code: Github
Organizing journal categories: We only keep journals that are SCI, SSIE and EI. However, the bibliographic information does not contain such information. The ISSN and/or EISSN numbers are available for the journals. Using the ISSN and EISSN numbers, whether the journal is SCI, SSCI and EI is collected by making network requests to the link 'https://publik.tuwien.ac.at/info/sci_search.php' using python. Additionally, the EI journal list with the ISSN number is collected from a different source. Finally, all the journals that are SCI, SSCI or EI are identified.
Python libraries used: urllib, BeautifulSoup
Source code: Github
Filter out invalid items: Items (documents) are removed if
a. The abstract is less than 100 characters or missing.
b. The journal is not SCI, SSCI or EI.
c. The language of the abstract is other than English.
Python libraries used: langdetect, Spacy
Source code: Github
Text cleaning: The abstracts are cleaned in the following ways:
a. Removing words other than noun, adjective, verb, adverb, proper noun.
b. Lemmatization is used to drop unnecessary characters, usually a suffix and to keep only the base dictionary form of a word, e.g., walking to walk.
c. Standardizing the name of the country and states of the USA.
d. Detecting any geographic information from the abstract.
e. Removed words that appear in more than 60% of the abstracts and less than 50 abstracts.
Python libraries used: Spacy, pycountry, scikit-learn, nltk
Source code: Github
Some statistics with cleaned data
The average number of publications each year is around 14,000
The number of publications of the top 30 countries are shown in this figure. The number of papers associated with the U.S. is very large in relative to other followers such as China, Australia and Canada.
Shows the distribution of papers over selected journals (top 30).
The journal with the most papers is Transportation Research Record (TRR)
Although TRR has the largest number of papers, TRB, the publisher of TRR, ranks the third among all publishers by the number of papers.
Two other publishers, namely Elsevier and American Society of Civil Engineers (ASCE), lead the ranking
Model building
Latent Dirichlet Allocation (LDA): LDA is one of the most popular methods for performing topic modeling. The aim behind the LDA is to find topics that the document belongs to, on the basis of the words contained in it. It assumes that documents with similar topics will use a similar group of words. This enables the documents to map the probability distribution over latent topics and topics are probability distribution.
Number of topics: 50 (the number is chosen based on other transportation bibliographic analyses with a comparable scale in the literature)
Model input: A (146972 x 2193) matrix where each row indicates each document and each column indicates each unique word.
Example input matrix
Model output:
The distribution of words in topic: for each topic, the weights of all words are found. The word with higher weights represents the topic.
The distribution of topics in the documents: For each document, we get the probability distribution of all topics. In this case, the topic with the highest weight is the topic of the document
LDA model output
Results
Topic-word distribution
50 topics are manually inspected and given appropriate names. There are some topics that do not present transportation research topics and usually appear together during academic writing. Such topics are named 'academic words' for topics 27, 34, and 49.
Word cloud for topics 0–24
Word cloud for topics 25-49
Distinct topics vs. overlapping topics
There is no overlap between the two groups of major words for T40 Soil and pile foundation and T47 Asphalt.
Some keywords overlap between T0 Driving simulation and T42 Driving behavior. The sensitivity analyses show that when the number of topics is reduced from 50 to 46, such topics with a substantial overlap are merged.
Research topic proportion among various studies
This figure shows the proportion of topics in all documents.
Topic 27, 34, and 49 are some of the top topics. These topics consist of academic words. Therefore, those words together create clusters.
Topic 37: Transportation sustainability is found to be the popular topic.
Research topic distributions for selected countries
(a) shows the topic distributions for China and Australia. Clearly, Australia is higher in T37 Transportation sustainability and T42 Driving behavior, while China is higher in T5 Optimization model and T8 Freight port.
(b) shows three European countries associated with similar topic distributions
(c) demonstrates the strong similarity between Australia and New Zealand
Countries with similar research topics
The figure shows the dendrogram for 30 countries/regions based on the Euclidean distance between topic distributions. Closely connected countries have similar research topics.
A few notable country groups are observed, such as (United States, Canada), (Japan, South Korea), (Norway, Finland), (Australia, New Zealand), and (Switzerland, Netherlands)
We find countries/regions that are geographically proximate are associated very similar research topic distributions
US States with similar research topics
The figure shows the dendrogram for 50 states based on the Euclidean distance between topic distributions. Closely connected states have similar research topics.
Virginia and Texas, despite being geographically remote, are associated with similar topic distributions.
California and New Jersey share quite similar topic distributions.
By contrast, two states bordered by the Gulf of Mexico, namely Florida and Louisiana, have quite different topic distributions.
EV-related research
The blue bar indicates the expected number of studies based on the proportion of EV-related research in all documents. The orange bar shows the actual number of studies found related to EV for the corresponding region.
Germany is much more likely to be a study region than France for EV-related research.
U.S. is not as likely as China to be a study region.
At the U.S. state level, California, New York, and Vermont stand out as the actual number of EV-related documents far exceeds the expected number for them.
Florida falls short of expectations.
Conclusion
Summary of the findings
Transportation topics in the United States are studied the most in the literature, followed by China, Australia, Canada, and the United Kingdom.
California, Texas, Florida, New York, and Washington are top study areas, at the U.S. state level.
Transportation research exhibits clear region-specific characteristics, namely each geographic region is associated with a unique transportation topic distribution
Countries/regions that are geographically proximate have quite similar transportation topic distributions in general, while this cannot be observed at the U.S. state level.
Countries/regions that are geographically proximate have quite similar transportation topic distributions in general, while this cannot be observed at the U.S. state level