Skip to content

Topic modelling with SpaCy, Gensim and Textacy

Notifications You must be signed in to change notification settings

cheTesta/topic-model

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Topic modelling with Spacy, Gensim and Textacy

The jupyter notebook 'topic-modelling.ipynb' contains the following sections:

  • Initialize: Setting up environment and loading data.
  • Text extraction. Phrase and tokens extraction with Gensim and Spacy.
  • Topic modelling. Using Textacy's LDA model.
  • Data processing. Calculating data for visualization and export.
  • Model evaluation. A collection of visualizations of the resulting topics.
  • Export data. The data can be used for creating more visualization or import into a graph.

General concept

The emphasis in this notebook is on facilitating an iterative process where you can easily adjust stopwords and number of topics. Furthermore it contains features to re-focus on sub topics and thereby create a hierachy of topics.

Input

'data-in/tb_data.tsv' contains ~2100 scientific articles with the following properties: doi/title/abstract/keywords.

Output

Start by looking at the notebook: "topic-modelling.ipynb". Somewhere down the file you will find the 'visualization' section that gives an overview of the modelling data.

Most of the other files in the output data directory (data-out/) is exported to be used as input in other projects. If you are interested in understanding the modelled topics more in detail you may look at 'tb_main_doc-top.html' output directory which contains a list the 15 most relevant articles for each topic.

Caveat

Topic modelling using LDA is an stochastic algorithm which will produce (slightly) different results even when run on the same data. The exact same results can therefore not be reproduced.

Inspiration

About

Topic modelling with SpaCy, Gensim and Textacy

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HTML 72.7%
  • Jupyter Notebook 27.3%