[Final Report] Knowledge Garden
Knowledge Garden
Contact
李德珺 keke18_0809@163.com
abstract
The objective of our project is to build up an encyclopedia that helps people, especially IEers, to grasp the structure of knowledge about Industrial Engineering, show the connection or relationship among branches of Industrial engineering. In the project, we collect all data from DBpedia and organize them structurally and efficiently. The output for user is a tree-structure of some related information the user may be interested in or the user searches for. Our project may be an essential part of building a knowledge city and influence people’s way of thinking and learning in the future.
Key word
encyclopedia, DBpedia, ontology, entry, category
1. Motivation
How to be an efficient and effective learner? It is the core question that every Industrial engineer in Knowledge City thinks about all the time. Faced with such a complicated net of knowledge, we are always losing ourselves in it. According to the theorem of information input and processing in human factors, the more the knowledge are structured and coded, the easier it can be remembered. With this in mind, we are trying to help learners of Industrial Engineering make full use of information on the Internet, and further explore the relationship among knowledge elements. Therefore, learners can developed a more systematical understanding of knowledge and then utilize Knowledge City smartly as well as develop it.
2. Goal
Since Knowledge Garden is such a tool helping learners develop a systematical knowledge structure for them, there’re two main problems involved:
(1) Where do Knowledge Gardeners obtain the data?
(2) How do Knowledge Gardeners stores data and utilize the data?
With the knowledge of database in this semester’s database course, we get to know some network databases, such as DBpedia. As the database of Wikipedia, DBpedia is of great authority. Thus we obtain data from DBpedia after thoroughly analyzing its structure and philosophy.
In a word, with an intention of network database, we build our client on PC, and connect to the Internet for data.
3. Database Design
3.1 Data Structure
Before designing our database, we need to explore the nature of our data and get a data structure.
As we all know, entries in our world are related to each other, and their relationships are too complex to make them clear to the users. Luckily, it comes out the fact that entries can be classified into categories, which can be used to collect the entries in the same field. And the sub or broader relations among entries can be directly moved on the categories. In this structure, our data can be well organized and clearly represented to the users. Thus, the structure of our data is directed graph.
3.2 E-R Model and Database
As the hierarchy structure of Categories and Entries in our project, the database we need is quite obvious. The E-R model is shown below:
We can see that there are two entity sets: one is Category and the other is Entry. Every entity has an ID and a name, while every Entry has contents. A category contains many entries, and an entry may be included in several categories. Categories are connected by the nature level of the concepts.
Based on the E-R model, the structure of our database is shown below:
4. Resource
4.1 Where do we obtain the data?
The most important breakthrough of Knowledge Garden is the utilization of DBpedia, from where we obtain all of our data.
As mentioned before, Knowledge Garden, as a professional encyclopedia in Industrial Engineering, is mainly on structurally organizing items in the whole knowledge system. The entries and categories have to be linked with their own related items, thus what we need is linked data that is structurally organized.
With this intend, we turn our notice to DBpedia. The DBpedia Ontology is a shallow, cross-domain ontology, which has been manually created based on the most commonly used infoboxes within Wikipedia. That is to say, DBpedia is a network database which extracts structured information from Wikipedia and makes this information available on the Web. The ontology currently covers over 205 classes which form a subsumption hierarchy and has 1,210 properties. Embedded with so much information in it, we should first analyze the structure of database in DBpedia.
4.2 Structure of Database in DBpedia
The structure building of DBpedia is a penetration of semantics Wikipedia. And these semantics definite and normalize the structure of DBpedia in response.
When a user is searching for an entry, he has to type in the name or synonym, which maps to the synset and returns the relative contents. This is called the synonym feature.
The DBpedia also includes other features. When the synonym of entry is typed in, DBpedia refers to infoboxes for the type of data, as well as the definition of the entry, which is always the first sentence in Wikipedia. What’s more, the relation between the entry and its categories is also represented. An entry may belong to several categories. For example, the entry “Industrial Engineering” belongs to categories such as “Manufacturing” and “Operation Research” and so on. Sub-categories and super-categories that are related to the obtained categories are also included in the searching results. Thus, infoboxes, definition and category hold the type feature together.
Moreover, the relation between one entry and another is also included. This kind of relation is implicit, and it only expresses the information of context in the web. And this is regarded as the relation feature.
We knowledge gardeners build our database structure based on the three features: synonym, type and relation. And this kind of semantic and ontology also corresponds to the design of our database in the previous section.
4.3 How to utilize the data?
After the featuring of DBpedia, what we do next is to crab the data from the web. In DBpedia, the representation of ontology within entries and categories are in RDF triples. There’re almost 103 million RDF triples in DBpedia up to now.
RDF (Resource Description Framework) is a family of World Wide Web Consortium (W3C) specifications originally designed as a metadata data model. It has come to be used as a general method for conceptual description or modeling of information that is implemented in web resources. As a flexible data model for representing extracted information and for publishing it onto the web, RDF triples are the key for us to crab the data from DBpedia.
We have developed a java program to traverse the database of DBpedia and extracted the information we need from it to the PC.
In the middle of the semester, we extracted over 800 entries and 30 categories to build the database, which contains 6 semantic hierarchies in Industrial Engineering. Up to now, we have extended the number of entries to 13699 and categories to 748 from DBpedia, where 9 hierarchies of related articles of Industrial Engineering are included. And the size of database reached over 6 M.
5. Web Design
5.1 Link Structure between the Web Pages
The link structure between the web pages can be shown in the following figure. Among the four web pages in Knowledge Garden, there are three searching pages, the home page, and the result page which shows the searching results and the fuzzy query page which shows the proper words which the user wants to search. The users can type in the searching content in each of the three pages, and after clicking on the search button, the query process will be carried on. In the query process, firstly, the system will search for the entry name which is the same as the searching content. If there is only one entry whose name is exactly the same with the searching content, it will jump to the result page. If there is more than one entry whose name is similar with the searching content, it will jump to the fuzzy query page. Click on the similar entries shown in the fuzzy query page, and you will get the searching result in the result page.
5.2 Fuzzy Query
5.2.1 Definition
At first, we defined fuzzy query as the query which has spelling error tolerance, but after some study we found that the spelling error tolerance request a large database which contains all the possible spelling errors. Finally, the fuzzy query in Knowledge Garden means it will output the similar entries from the database which has one or more same parts as the searching entry.
5.2.2 Advantage
a) Simplify the database
Using this kind of fuzzy query we don’t need to build up the large spelling error tolerance database, so the capacity of the database is reduced.
b) Improve the efficiency
The user can find what he/she wants to search even though it is not exactly the same as the entry name in the database, and hence it will help the user to search more efficiently.
5.3 Data Selection
There are three kind of searching process in the data selection among the four tables in the database as shown below.
1) Search the category name given the entry name in the following method.
a) Select the entry content whose entry name is the same as what the user wants to search for from the entry list.
2) Search the category name given the entry name in the following method.
a) Select the entry id whose entry name is the same as what the user wants to search for from the entry list.
b) Select the category id according to the entry id from the entry-class list.
c) Select the category name according to the category id from the category list.
3) Search the entry name given the category name in the following method.
a)Select the entry id whose entry name is the same as what the user wants to search for from the entry list.
b) Select the category ids according to the entry id from the entry-class list.
c) Select the other entry ids whose category id is the same as that of the original entry id from the entry-class list.
d) Select the entry names according to the entry ids from the entry list.
5.4 Rank
5.4.1 Rank the entries
The number of the relating entries is very large, so it may take a lot of time to find the one the user prefers to search for. Thus we rank the entries of a category by the times they have been searched in the following ways.
a) Add an attribute “rank” to the entry list.
b) The number in its rank will add 1 every time the result page runs.
c) Output the entries ordered by their ranks.
5.4.2 Advantages
a) Show the searching results more clearly.
b) Help the user to find the required result more efficiently.
6. Visualization
6.1 Visualization Tools
Knowledge Garden provides user with 2 kinds of visualization tools – Knowledge Tree and Entry Cloud. They can show the relation between entries and categories in the form of a graph or dynamically shows the related entries to a specific one, which helps users to learn about industrial engineering better with knowledge garden.
6.1.1 Knowledge Tree
Knowledge Tree is a visualization tool we have developed by applying Google Visualization API. It can show the relationship between entries and categories in a tree graph.
The number of trees shown together is determined by the number of categories that the searched entry belongs to, and the children of the top node (category) are the entries belonging to this category.
6.1.2 Entry Cloud
Entry Cloud is a visualization tool developed by applying SWFObject script (the SWF embed script known as FlashObject). It is a tool that dynamically shows the related entries of the searched entry in a rotating ball with tags embedded on it. If user click on one of them, the webpage will jump to Google and search for the specific one.
6.2 User Preference
In Knowledge Garden, we develop a function for users to set the preference for the number of entry nodes in the Knowledge Tree and number of entries in the Entry Cloud.
By setting the preference, a new TXT file recording the parameters will be created in user’s personal computer, when the user open the webpage of Knowledge Garden again, the parameter will be automatically read.
6.3 Extra Function
In Knowledge Garden, we also provide users with some extra function to help them to use Knowledge Garden. They are show as below:
a) Help Document A document web page teaching users how to use Knowledge Garden.
b) About Us Directing to http://toyhouse.cc/memo
c) Set as Favorite
d) Set as Homepage
7. Project Web Page
8. Experience and Lessons We Learn
Through the course, our team learns a lot:
8.1 Conceptual
a) We are faced with a time of information explosion. So finding the right way to store and utilize information in database is the key to achievement and organized life.
b) Relationship between data itself is a kind of data, too. The more relations, the more useful the data is.
c) The current searching engine is based on key word. But the server doesn’t understand the meaning of words. If we could make the words “semantic”, it will be a great progress to provide more useful and target-oriented information for users.
8.2 Technical
a) RDF language use triples to store the relationship of objects and subjects.
b) URL address is the basis of Network Database to locate the information.
c) We learn systematically about the Network Database, especially the DBpedia.
d) We find out the database of Wikipedia is DBpedia. It consists of millions of RDF triples. The subjects and objects are all URL address. All the kinds of predicate are packed in a package.
e) The structure of DBpedia is made of two kinds of element: Category and Entry. Entries belong to Categories. And there exists a tree structure between Categories.
f) To realize the functions, we learn to design the E-R diagram. According to the E-R model, we build our database and input the data grabbed from DBpedia.
g) We learn about the SQL language to query the database.
h) We use Google Visualization and other tools to realize the visualization
i) We use PHP and HTML language to draw the beautiful webpage.
8.3 Spiritual
a) Good time management is more important than the individual ability.
b) You cannot let someone speak about what he (she) doesn’t understand.
c) No pains, no gains.
Reference:
[1] 张华杰,Knowledge extraction and reuse in Wikipedia 申请硕士学位论文 上海交通大学 2009年1月
[2] Bhole, A., Fortuna, B., Grobelnik, B., Mladenic, D. Extracting Named Entities and Relating Them over Time Based on Wikipedia. Informatica. 2007
[4] http://blog.georgikobilarov.com/2009/11/dbpedia-ontology-designed-to-break/
[5] http://www.freebase.com/view/guid/9202a8c04000641f800000000b17ba6c
[7] http://www.ontotext.com/ldsr/inference.html

















