Drexel University
Drexel University
Due to its unique structural features and rich usergenerated
content, Wikipedia is being increasingly
recognized as a useful knowledge resource that can be
exploited for various applications. Nevertheless, the
mode of information search and retrieval on Wikipedia
remains that of conventional keyword-based search and
retrieval of a list of articles ranked in terms of keyword
matching. The objective of the ongoing project reported
in this paper is to create a Web-based knowledge portal
using the data extracted from Wikipedia, for the sake of
enabling semantics-based search and exploration. The
methodology is currently applied to the philosophy domain.
Hence the project name: WikiPhiloSofia. In this
paper we present extraction and visualization of the facts,
relations, and networks involving 300 major philosophers
as obtained from Wikipedia and we partially compare
the results against those obtained by using Thomson
Reuters’ Arts & Humanities Citation Index data. Insofar
as the work aims at enabling semantics-based search and
exploration by exploiting the user-generated content, it
embodies the movement toward Web 3.0, i.e., the convergence
of the Social Web (Web 2.0) and the Semantic
Web. Insofar as it emphasizes information visualization
for the sake of enhancing the user’s information seeking
experience at the aesthetic level as well as at the cognitive
level, however, the work also embodies the trend
toward information aesthetics. As such, the WikiPhiloSofia
project serves as a venue of the convergence
of arts, humanities, and computer/information science/
technology, contributing to a paradigm shift toward the
next generation of online search, retrieval, and delivery
of digital information in the humanities.
Introduction
Philosophy (and here we mean Western philosophy),
dating back at least to approximately 600 BC, is one of
the oldest of all academic disciplines and is, in particular, one of the core disciplines in the humanities. Partly
due to its long history, and partly due to the nature of
the discipline itself, the domain of philosophy presents
a rich semantic network involving an extended genealogy
of philosophers and related philosophical concepts,
ideas, and doctrines, which can be explored and examined
from diverse perspectives.
Wikipedia (http://www.wikipedia.org) is an open-access,
collaborative Web encyclopedia project, initiated by
Jimmy Wales and Larry Sanger in 2001. Wikipedia has
since grown rapidly to become one of the most soughtafter
resources on the Web. Due to its collaborative way
of construction and due to its impressive size and growth
rate, Wikipedia is considered a foremost example of Web
2.0 applications.
The objective of the ongoing project reported here, entitled
WikiPhiloSofia (formerly known as The WikiPhil
Portal), is to extract, analyze, and visualize meaningful
and interesting facts, relations, and connections among
philosophers and philosophical concepts via the automatic
processing of the structural features and semantic
content of Wikipedia. By doing so, we aim at creating a
useful and user-friendly portal for students of philosophy
as well as the general public, thereby contributing to the
cause of digital humanities. The project is still in its early
stage. However, the current paper extends our previous
work on the project (Athenikos and Lin, 2008), especially
by presenting new results and findings. Specifically,
the paper illustrates extracting and visualizing the data
concerning facts, relations, and networks involving 300
philosophers, as extracted from Wikipedia, for semantics-
based search and exploration, and it presents a partial
comparison of the Wikipedia data against those from
Thomson Reuters’ Arts & Humanities Citation Index
(http://thomsonreuters.com/products_services/scientific/
Arts_Humanities_Citation_Index).
Wikipedia Data Extraction
A prototype system for the project was implemented in
Java using the Java servlet technology. First we describe
the materials, methods, and results involving data extraction.
Materials
We only used the English version of Wikipedia. The initial
results we had obtained in the project (Athenikos and
Lin 2008) were based on the data extracted from Wikipedia
pages downloaded on 29 May 2008. The results
we report in this paper are based on the Wikipedia pages
downloaded more recently on 23 December 2008.
Methods
We obtained a chronological list of 300 philosophers (including
influential theologians, writers, scientists, etc.)
from Wikipedia’s “Timeline of Western Philosophers”
page. We extracted information on the hyperlink connections
and academic/biographical facts concerning the
philosophers (as presented in infoboxes and wikitables)
from their individual Wikipedia article pages, and stored
the data in the form of semantic triples (Subject–Predicate–
Object) in a MySQL database. We retrieved the
data needed for visualization by querying the database,
and stored the results as XML files marked up with
GraphML and TreeML.
Results
Table 1 shows the types of information extracted. Table
2 summarizes the basic statistics concerning the dataset.
Table 1. Types of information extracted.
Table 2. Basic statistics on the Wikipedia dataset.
As shown in Table 2, while there exists a high number of
hyperlink connections among the 300 philosopher pages,
there are a few pages that do not contain any out-links to
the other philosopher pages and/or do not receive any inlinks.
Also, only 192 philosopher pages contain infoboxes
that summarize academic/biographical facts, which is
(at least in part) why there are some philosophers that
are shown to have no influenced/influenced-by relations.
Semantic Search Interface
We created a Web portal interface via which the user can
issue queries on the facts, relations, and networks involving
300 philosophers and explore the results displayed using diverse modalities of interactive information visualization
as will be illustrated in the next section. Figure
1 shows the homepage of the WikiPhiloSofia portal
(http://research.cis.drexel.edu:8080/sofia/WPS/). Upon entering the portal, the user will choose to focus
on one philosopher, two philosophers, or all 300 philosophers.
In case a user chooses to focus on one philosopher,
for example, the user is directed to the menu shown
in Figure 2, which contains three fields corresponding to
Subject, Predicate, and Object. The user can select a philosopher
from the first dropdown menu, and then select
a predicate from the second dropdown menu, in order
to retrieve relevant objects. Table 3 summarizes various
query options and result display modalities. Interactive Visualization
The query results are presented via interactive visualization,
implemented by using the Prefuse information visualization
toolkit (http://prefuse.org).
Figure 3 presents a radial graph representing the facts
about Plato, which amounts to visualizing the semantic
network involving the philosopher. Figure 4 shows a
fisheye tree representing extended influences originating
from Plato. The visualization of non-overlapping extended link/influence
relations from/to one philosopher (using a graph
or radial graph), and of strongest link/influence connections
among 300 philosophers, is implemented by using
a novel graph simplification method that we have developed,
called the Strongest Link Paths (SLP) (Athenikos
and Lin, 2008), which substantially simplifies the graph
topology while highlighting the most dominant nodes
and their interconnections. Figure 6 presents a radial graph, simplified via SLP,
which represents extended influences originating from
Thales. This amounts to visualizing the small-world
network (Milgram, 1967) of influence involving Thales.
The figure shows that Thales, the first philosopher on the
chronological list of 300 philosophers, can reach Foucault,
the last one, within 3 degrees of separation (via
Anaximander and Heidegger).
The graphs that result from applying SLP to the hyperlink/
influence connections consist of distinct clusters
separated from one another. Figure 7 shows a close-up of
the largest cluster in the strongest out-link network that
centers on Plato and Aristotle. Figure 8 shows the largest
cluster in the strongest influenced-relation network, centering on Kant. Comparison with AHCI Data
In this section we discuss some of the results of comparing
the Wikipedia dataset against a subset of Thomson
Reuters’ Arts & Humanities Citation Index (AHCI) that
contains 1.26 million records covering the 10-year period
of 1988-1997.
Table 4 lists top 20 philosophers that receive the greatest
number of in-links from among 300 philosophers in the
Wikipedia dataset, those that receive in-links from the
greatest number of philosophers in the Wikipedia dataset,
and those (among the 300 philosophers) that have
the highest citation count in the AHCI dataset.
Interestingly, Aristotle shows up on top for all 3 categories.
While there are certain differences among the 3
lists, the majority of the philosopher names that appear
on the 2 lists involving the Wikipedia dataset include
major figures in the philosophy domain, as does the list
obtained from the AHCI dataset. This shows that the hyperlink data extracted from Wikipedia, which embodies
a huge amount of latent human annotation (Chakrabarti
et al., 1999), provide a fairly good representation of the
central figures in philosophy. In order to prevent a naïve,
simplistic, and literal interpretation of these findings, it
must be mentioned that what we argue is not that these
philosophers are central figures because they have a
large number of hyperlink connections in Wikipedia or
that their relative centrality corresponds to link counts. Table 5 shows a comparison of the list of philosophers
who have bi-directional link connections with Heidegger
in the Wikipedia dataset and the list of philosophers (considering
only those that belong to the 300 philosopher
set) that are most frequently co-cited with Heidegger
within the AHCI dataset. As shown, 13 out of 20 most
co-cited philosophers appear on the list of bi-linked philosophers.
In addition, 3 out of the remaining 7 co-cited
philosophers (Descartes, Wittgenstein, and Augustine)
have hyperlink connections with Heidegger in one direction.
Again, we are not making a naïve argument that
those who are bi-linked or co-cited with Heidegger are
intellectually closer or even similar to him in the order
of bi-link/co-citation counts. It is however interesting
to note the overlap between the two lists. In most cases
shown in the table it is not hard to imagine why a certain
philosopher may be bi-linked and/or co-cited with Heidegger,
even though the reasons vary among the cases. Figure 9 shows a radial graph representing 25 philosophers
most frequently co-cited with Heidegger and with
Dewey, respectively. Those co-cited with Dewey include
Heidegger. Those co-cited with both of them include
Wittgenstein. Related Work and Discussion
Wikipedia has recently become a topic of intense interest
among researchers who recognize its utility as a source
of a vast amount of knowledge that can be exploited for
various applications. What renders Wikipedia a particularly
valuable resource is the fact that it can be mined for
knowledge based on its structural features as well as its
textual content (Zesch et al., 2007). As such, some Semantic
Web (Berners-Lee et al., 2001) researchers have
turned to Wikipedia for clues to mitigating the knowledge
acquisition bottleneck (Krötzsch et al., 2005). In this paper we have demonstrated extracting/visualizing
structured data available in Wikipedia by exploiting its
hyperlinks (cf. Bellomi and Bonato, 2005), category
links (cf. Chernov et al., 2006), and templates (i.e., infoboxes
and wikitables) (cf. Auer and Lehmann, 2007).
The logical extension of the current approach would
be to extend the methodology to derive more extensive
and implicit relations and connections, by exploiting the
textual content of Wikipedia articles and by employing
inference.
Social network analysis (SNA) has been used for some
time in diverse disciplines besides sociology. With the
advent of Web 2.0 (O’Reilly, 2005), characterized by
the emergence of various collaborative authoring, blogging,
bookmarking, tagging, networking, etc. sites that
utilize combined social capital, SNA has become a key
technique for capturing and exploiting data on social
connections and interactions for various applications.
Even though we have not attempted (and do not intend)
to compute various centrality measures used in SNA
(Wasserman and Faust, 1994), we have shown that the
lists of highly-connected philosophers obtained by using
even the simple hyperlink data in general provide a
good representation of the central figures in the philosophy
domain (not in the naïve sense that rankings purely
based on hyperlink statistics correspond to the relative
importance of each philosopher one-to-one). We have
also shown that the networks of philosophers emerging
from the hyperlink and semantic data extracted from
Wikipedia exhibit the characteristic of the small world
(Milgram, 1967) or the six degrees of separation phenomenon.
Insofar as the networks that we consider in this project
are concerned with the connections among philosophers
(and philosophical concepts), indirectly derived from
Wikipedia, and not with the direct connections and interactions
among the editors of the corresponding Wikipedia
articles, the project is related to citation analysis, in particular, author co-citation analysis (ACA) (White,
2003). In this regard, we have presented a partial comparison
of the results based on the Wikipedia data against
those obtained by applying ACA to Thomson Reuters’
AHCI data, using specific examples. While the results
have shown an overall correspondence between the two
datasets, it must be pointed out that the comparison was
limited to 300 philosophers considered. It must again be
emphasized that we do not equate link count or co-citation
count with intellectual closeness or similarity.
Lastly, the WikiPhiloSofia project is prominently a project
about visualization as an effective mode of data/
information/knowledge representation. Information visualization,
via the use of interactive, visual representations
of abstract data, serves to amplify human cognition,
making it possible or easier to recognize the hidden
patterns and structures that might not otherwise be quite
apparent or comprehensible (Card et al., 1997; Tufte,
1990), while enhancing the user’s information search experience
at the aesthetic level as well. We have illustrated
how the results of various user queries can be presented
using diverse modalities of visualization, by effectively
visualizing the facts, relations, and networks that pertain
to the 300 philosophers in the Wikipedia dataset. In particular,
we have applied the Strongest Link Paths (SLP)
method (Athenikos and Lin, 2008), which selects only
the strongest link connections (measured in terms of hyperlink
count or other connection strength measure) in
order to highlight the most significant nodes and links.
Even though SLP is rather simpler than other graph scaling
methods such as pathfinder network (Schvaneveldt,
Durso, and Dearholt, 1989) or main path analysis (Hummon
and Doreian, 1989), we have found that it allows
us to achieve substantial data reduction and to obtain a
meaningful representation of the dominant figures and
their connections within the network of philosophers
even from the simple hyperlink data.
Conclusion
The WikiPhiloSofia project aims at creating a knowledge
portal based on the data extracted from Wikipedia. In
this paper we have illustrated extracting and visualizing
the facts, relations, and networks involving 300 major
philosophers in order to enable semantics-based search
and exploration. The future work will include extending
the approach to include more philosophers, extracting
and visualizing the connections among philosophical
concepts, deriving more extensive and implicit relations
and connections, as well as applying the methodology
to domains other than philosophy for comparison and
evaluation purposes.
References
Athenikos, S.J. and Lin, X. (2008). The WikiPhil Portal:
Visualizing Meaningful Philosophical Connections,
Presented at 2008 Chicago Colloquium on Digital Humanities
and Computer Science (DHCS 2008), Chicago,
IL, November 2008. Forthcoming in Proceedings of the
Chicago Colloquia on Digital Humanities and Computer
Science.
Auer, S. and Lehmann, J. (2007). What Have Innsbruck
and Leipzig in Common? Extracting Semantics from
Wiki Content, Proceedings of 4th European Semantic
Web Conference (ESWC 2007), Innsbruck, Austria, June
2007. Bellomi, F. and Bonato, R. (2005). Network Analysis
for Wikipedia, Proceedings of the First International
Wikimedia Conference (Wikimania 2005), Frankfurt am
Main, Germany, August 2005.
Berners-Lee, T., Handler, J., and Ossila, O. (2001). The
Semantic Web, Scientific American, 284: 34-43.
Card, S. K., Mackinlay, J. D., and Shneiderman, B.
(eds.). (1997). Readings in Information Visualization:
Using Vision to Think. Morgan Kaufman Publishers, San
Francisco, CA.
Chakrabarti, S., Dom, B. E., Kumar, S. R., Raghavan, P.,
Rajagopalan, S., Tomkins, A. D., Gibson, D., and Kleinberg,
J. (1999). Mining the Web’s Link Structure, Computer,
32(8): 60-67.
Chernov, S., Iofciu, T., Nejdl, W., and Zhou, X. (2006).
Extracting Semantic Relationships between Wikipedia
Categories, Proceedings of the First Workshop on Semantics
Wikis (SemWiki 2006) at the Third European
Semantic Web Conference (ESWC 2006), Budva, Montenegro,
June 2006.
Hummon, N. P. and Doreian, P. (1989). Connectivity in
a Citation Network: The Development of DNA Theory,
Social Networks, 11: 39-63.
Krötzsch, M., Vrandečić, D., and Völkel, M. (2005).
Wikipedia and the Semantic Web – the Missing Links,
Proceedings of the First Wikimedia Conference (Wikimania
2005), Frankfurt am Main, Germany, August
2005.
Milgram, S. (1967). The Small World Problem, Psychology
Today, 1(1): 60–67.
O’Reilly, T. (2005). What Is Web 2.0: Design Patterns
and Business Models for the Next Generation of Software.
http://www.oreillynet.com/pub/a/oreilly/tim/
news/2005/09/30/what-is-web-20.html (Last accessed
12 November 2008).
Schvaneveldt, R.W., Durso, F.T., and Dearholt, D.W.
(1989). Network Structures in Proximity Data, The Psychology
of Learning and Motivation: Advances in Research
and Theory, vol. 24, G. Bower (ed.), 249-284.
Academic Press, New York.
Tufte, E. R. (1990). Envisioning Information. Graphics
Press, Cheshire, CT.
Wasserman, S. and Faust, K. (1994). Social Network
Analysis. Methods and Applications. Cambridge University
Press, Cambridge, UK.
White, H. D. (2003). Pathfinder Networks and author
Cocitation Analysis: A Remapping of Paradigmatic Information
Scientists, Journal of the American Society for
Information Science and Technology, 54(5): 423-434.
Zesch, T., Gurevych, I., and Mühlhäuser, M. (2007). Analyzing
and Accessing Wikipedia as a Lexical Semantic
Resource, Proceedings of the Biannual Conference of
the Society for Computational Linguistics and Language
Technology, Tübingen, Germany, April 2007.
If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.
Complete
Hosted at University of Maryland, College Park
College Park, Maryland, United States
June 20, 2009 - June 25, 2009
176 works by 303 authors indexed
Conference website: http://web.archive.org/web/20130307234434/http://mith.umd.edu/dh09/
Series: ADHO (4)
Organizers: ADHO