Text-Garden -- Text-Mining Software Tools
(c)
Marko Grobelnik,
Dunja Mladenic
Department of Knowledge Technologies
Jozef Stefan Institute,
Slovenia
Text-Mining Software Tools enable easy handling of text documents for the
purpose of data analysis including automatic model generation and document
classification, document clustering, document visualization, dealing with Web
documents, crawling the Web and many other. The code is written in C++ and
originally runs on Windows
platform and using Wine or similar utility can be run on Linux/Unix.
The code was developed through our own research needs guided by our research
projects and refined/polished as the time permitted. The top level components
build on the core of the software are contributed through the time by several people from our
group including Janez Brank,
Blaz Fortuna, Miha Grcar, Jure Leskovec, Blaz Novak.
Please reference the Web site <www.textmining.net>, if you are using any of the
provided utilities.
File formats use for document representation
Three main file formats for document representation are used by the tools.
They cover different ways of handling text documents:
- Compact-Documents format with the file
extension ".Cpd"
- Text-Base format with the file extension ".TBs"
- Bag-Of-Words format with the file extension
".Bow"
Pre-processing of Documents
-
Text To Compact-Documents Converter (Txt2Cpd) Download
Transforms various raw text formats, such as Text-Base and some standard datasets (eg., Reuters)
into the file in Compact-Documents format (".Cpd").
Parameters and example call.
-
Text To Bag-Of-Words Converter (Txt2Bow) Download
Transforms various raw text formats, such as Text-Base,
Transactions-File, Compact-Documents-File, some standard datasets (eg., Reuters)
into the file in Bag-Of-Words format ".Bow".
Parameters and example call.
- Html To Xml Converter (Html2Xml) Download
Transforms Html documents into cleaned XML documents.
Parameters and example call.
- Html To Text Converter (Html2Txt) Download
Transforms Html documents into cleaned text documents.
Parameters and example call.
Document Clustering
-
Bag-Of-Words K-Means clustering (BowKMeans) Download
Performs K-Means clustering on the Bag-Of-Words format of
document and outputs the clustering of documents in different format, such
as text file or XML file.
Parameters and example call.
- Bag-Of-Words Hierarchical-K-Means clustering (BowHKMeans)
Download
Performs hierarchical K-Means clustering on the Bag-Of-Words format of
documents and outputs the clustering of documents in different format, such
as text file or XML file.
Parameters and example call.
Learning Model for Classification of Document
-
Binary-class Support-Vector-Machine training algorithm (BowTrainBinSVM)
Download
Learns a model via training a binary-class Support Vector Machine on the set
of input documents provided in the Bag-Of-Words format.
Parameters and example call.
- One-Class Support-Vector-Machine training algorithm (BowTrainOneClassSVM)
Download
Learns a model via training a regression Support Vector Machine on the set
of input documents provided in the Bag-Of-Words format.
Parameters and example call.
- Logistic Regression training algorithm (BowTrainLogReg)
Download
Learns a model using
logistic regression on the set
of input documents provided in the Bag-Of-Words format.
Parameters and example call.
- Regression Support-Vector-Machine training algorithm (BowTrainRegSVM)
Download
Learns a model via training a regression Support Vector Machine on the set
of input documents provided in the Bag-Of-Words format.
Parameters and example call.
- Winnow training algorithm (BowTrainWinnow)
Download
Learns a model via training a Winnow on the set
of input documents provided in the Bag-Of-Words format.
Parameters and example call.
- Perceptron training algorithm (BowTrainPerceptron)
Download
Learns a model via training a Perceptron on the set
of input documents provided in the Bag-Of-Words format.
Parameters and example call.
Using unlabeled data

Classification of Documents
-
Generic classifier using models created with one of the training algorithms (BowClassify)
Download
Classifies input documents provided in the Bag-Of-Words format using the provided model.
Parameters and example call.
- Classification using nearest-neighbour algorithm (Bow2NNbrs)
Download
Classifies input documents provided in the Bag-Of-Words format using the provided model.
Parameters and example call.
Feature construction/extraction
- Learning semantic-space of documents (Bow2SemSpace)
Download
Creates semantic-space
representation of documents based on Latent Semantic Indexing.
Parameters and example call.
- Projecting on the semantic-space of documents (ProjBow2SemSpace)
Download
Projects documents
provided in the Bag-Of-Words format to semantic-space representation of
documents that was previously generated using
Bow2SemSpace.
Parameters and example call.
- Learning language independent semantic-space of documents (PrSet2SemSpace)
Download
The utility learns language
independent semantic space for two languages from paired corpus.
Parameters and example call.
- Feature Extraction from Images (ImgFeatures)
Download
The utility extracts various
groups of features from input images.
Parameters and example call.
Vizualization of documents based on clustering
- Creating graph representation of documents (Bow2VizGraph)
Download
Creates graph representation of input documents provided in the Bag-Of-Words format
and outputs .xml file.
Parameters and example call.
- Creating tiling representation of documents (Bow2VizTile)
Download
Creates tiling representation of input documents provided in the Bag-Of-Words format
and outputs .xml file.
Parameters and example call.
- Visualization of documents represented with graph representation (BowGraphViz)
Download
Provides visualization of documents as graph using graphical interface.
Parameters and example call.
- Visualization of documents represented with tiling representation (BowTileViz)
Download
Provides visualization of documents with tiling representation using
graphical interface.
Parameters and example call.
- Graph visualization (GraphDraw) Download
Draws a graph on the computer screen.
Parameters and example call.
Vizualization of documents based on sematic-space
Simple Web Mining
Crawling
Search engine