Skip to content

Wikidata and Wikipedia data extraction for Scribe applications

License

Notifications You must be signed in to change notification settings

abhijeet78880/Scribe-Data

 
 

Repository files navigation

Scribe Logo

platforms issues language pypi pypistatus license coc twitter codestyle matrix

Wikidata and Wikipedia data extraction for Scribe applications

Scribe-Data contains the scripts for extracting and formatting data from Wikidata and Wikipedia for Scribe applications. Updates to the language keyboard and interface data can be done using scribe_data/load/update_data.py and the notebooks within the scribe_data/load directory.

Note: The contributing section has information for those interested, with the articles and presentations in featured by also being good resources for learning more about Scribe.

Scribe applications are available on iOS, Android (WIP) and Desktop (planned).

Contents

Process

scribe_data/load/update_data.py and the notebooks within the scribe_data/load directory are used to update all data for Scribe-iOS, with this functionality later being expanded to update Scribe-Android and Scribe-Desktop when they're active.

The main data update process in update_data.py triggers SPARQL queries to query language data from Wikidata using SPARQLWrapper as a URI. The autosuggestion process derives popular words from Wikipedia as well as those words that normally follow them for an effective baseline feature until natural language processing techniques are employed. Functions to generate autosuggestions are ran in gen_autosuggestions.ipynb. Emojis are further sourced from Unicode CLDR, with this process being ran in gen_emoji_lexicon.ipynb.

The ultimate goal is that this repository will house language packs that are periodically updated with new Wikidata lexicographical data and data from other sources. These packs would then be available to download by users of Scribe applications.

Contributing

Public Matrix Chat

Scribe uses Matrix for communications. You're more than welcome to join us in our public chat rooms to share ideas, ask questions or just say hi :)

Please see the contribution guidelines if you are interested in contributing to Scribe-Data. Work that is in progress or could be implemented is tracked in the issues and projects. Also check the -priority- labels in the issues for those that are most important, as well as those marked good first issue that are tailored for first time contributors.

After your first few pull requests organization members would be happy to discuss granting you further rights as a contributor, with a maintainer role then being possible after continued interest in the project. Scribe seeks to be an inclusive and supportive organization. We'd love to have you on the team!

Ways to Help

Road Map

The Scribe road map can be followed in the organization's project board where we list the most important issues along with their priority, status and an indication of which sub projects they're included in (if applicable).

Data Edits

Scribe does not accept direct edits to the grammar JSON files as they are sourced from Wikidata. Edits can be discussed and the queries themselves will be changed and ran before an update. If there is a problem with one of the files, then the fix should be made on Wikidata and not on Scribe. Feel free to let us know that edits have been made by opening a data issue and we'll be happy to integrate them!

Supported Languages

Scribe's goal is functional, feature-rich keyboards and interfaces for all languages. Check the extract_transform directory for queries for currently supported languages and those that have substantial data on Wikidata.

The following table shows the supported languages and the amount of data available for each on Wikidata and via Unicode CLDR for emojis:

Languages Nouns Verbs Translations Prepositions Emoji Keywords
French 16,815 5,450 67,652 - 2,493
German 29,272 3,557 67,652 187 2,901
Italian 8,646 73 67,652 - 2,463
Portuguese 5,191 495 67,652 - 2,336
Russian 194,419 11 67,652 13 3,834
Spanish 27,128 4,036 67,652 - 3,144
Swedish 42,807 4,394 67,652 - 2,916

* Given the current beta status where words are machine translated.

Only for languages for which preposition annotation is needed.

Featured By

Articles and Presentations on Scribe

2023

2022


Wikimedia Deutschland Logo           MediaWiki logo          

Powered By

Contributors

Many thanks to all the Scribe-Data contributors! 🚀

Blog posts

List of referenced posts

Wikimedia Communities


Wikidata logo           Wikipedia logo          

About

Wikidata and Wikipedia data extraction for Scribe applications

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 91.9%
  • Jupyter Notebook 8.1%