Extract XML from the OS X dictionaries.

Overview

Before You Start

Apple-peeler was written using python 3.9 (but it should be trivial to support earlier versions of python 3.5+).

Installation

pip install apple-peeler

Dependencies

BeautifulSoup 4, lxml, and click

Usage

Apple likes to move around the dictionaries location from macOS version to macOS version. So if the dictionaries are no longer at the path below you can tell apple-peeler where to look by exporting DICT_BASE in your environment or using the --base option below.

export DICT_BASE="/System/Library/AssetsV2/com_apple_MobileAsset_DictionaryServices_dictionaryOSX/"

After that, useage is straightforward.

Usage: apple-peeler [OPTIONS]

Extract XML from Apple Dictionary files.

Options:
--base DIRECTORY                The root directory of the OS X dictionaries.
                                (Default: /System/Library/AssetsV2/com_apple
                                _MobileAsset_DictionaryServices_dictionaryOS
                                X/) [Env var DICT_BASE]
--out DIRECTORY                 The path to place extracted XML files.
-d, --dictionary [
    all|Arabic - English|Danish|Duden Dictionary Data Set I|Dutch|
    Dutch - English|French|French - English|French - German|German - English|
    Hebrew|Hindi|Hindi - English|Indonesian - English|Italian|
    Italian - English|Korean|Korean - English|New Oxford American Dictionary|
    Norwegian|Oxford American Writer's Thesaurus|
    Oxford Dictionary of English|Oxford Thesaurus of English|
    Polish - English|Portuguese|Portuguese - English|Russian|
    Russian - English|Sanseido Super Daijirin|
    Sanseido The WISDOM English-Japanese Japanese-English Dictionary|
    Simplified Chinese - English|Simplified Chinese - Japanese|Spanish|
    Spanish - English|Swedish|Thai|Thai - English|
    The Standard Dictionary of Contemporary Chinese|Traditional Chinese|
    Traditional Chinese - English|Turkish|Vietnamese - English]
                                The dictionary to extract or 'all'.
                                (Default: all) [Accepts multiple]
--format-xml / --no-format-xml  Format the XML files using BeautifulSoup.
                                (Default: False)
--debug                         Output debug information to STDERR.
                                (Default: False)
--help                          Show this message and exit.

Introduction

I need a ton of dictionary data for prototyping my learning a language tool, Parsnip, and licensing 40 dictionaries seems too expensive for a bootstrapper prototyping / working on an MVP (I look forward to the day this is no longer true). [Note: I am not planning to redistribute or otherwise use the data in an unlicensed manner.]

Parsnip uses Natural Language Processing and Dictionaries to decouple the word <-> sentence tug-of-war that's existed as long as flashcards have been used for language learning. I.e., should I make a word (concept) or a sentence (example) flashcard?

I care about what words I know for tracking purposes, but I want those words in context when I'm practicing. So the learning system breaks down sentences into lemmas (or dictionary form of a word) and a database of example sentences that the words appear in. This resolves the conceptual tug-of-war for flashcards.

But by removing reference data from the flashcards themselves, I need to integrate reference material directly into Parsnip's UI. JMDict is a great open source project for this, but that only covers a single language. So, I've been keeping my eyes open for people working on extracting the data from Apple's bundled dictionaries.

This has been a community effort that's spanned several years. My contribution is to collect the results, clear up some details about the file format, and package it into a general command-line tool.

References

This is inspired by Reverse-Engineering Apple Dictionary. And the discussion on Hacker News Hacker News: Reverse-Engineering Apple Dictionary (2020). Special thanks to tim-- and enragedcacti who introduced me to binwalk. And dunham who mentioned the random bytes looking like ints of payload sizes.

Additionally, I've found these posts informative:

You might also like...
extract gene TSS/TES site form gencode/ensembl/gencode database GTF file and export bed format file.

GetTsite python Package extract gene TSS/TES site form gencode/ensembl/gencode database GTF file and export bed format file. Install $ pip install Get

Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors

Parsel Parsel is a BSD-licensed Python library to extract and remove data from HTML and XML using XPath and CSS selectors, optionally combined with re

Finally decent dictionaries based on Wiktionary for your beloved eBook reader.
Finally decent dictionaries based on Wiktionary for your beloved eBook reader.

eBook Reader Dictionaries Finally, decent dictionaries based on Wiktionary for your beloved eBook reader. Dictionaries Catalan 🚧 Ελληνικά (help welco

Generates password lists/dictionaries based on keywords written in python3.

dicbyru Introduction Generates password lists/dictionaries based on keywords. It uses the keywords and adds capital letters, numbers and special chara

This utility synchronises spelling dictionaries from various tools with each other.

This utility synchronises spelling dictionaries from various tools with each other. This way the words that have been trained on MS Office are also correctly checked in vim or Firefox. And vice versa of course.

Safely add untrusted strings to HTML/XML markup.

MarkupSafe MarkupSafe implements a text object that escapes characters so it is safe to use in HTML and XML. Characters that have special meanings are

Converts XML to Python objects

untangle Documentation Converts XML to a Python object. Siblings with similar names are grouped into a list. Children can be accessed with parent.chil

Python module that makes working with XML feel like you are working with JSON

xmltodict xmltodict is a Python module that makes working with XML feel like you are working with JSON, as in this "spec": print(json.dumps(xmltod

Create Open XML PowerPoint documents in Python

python-pptx is a Python library for creating and updating PowerPoint (.pptx) files. A typical use would be generating a customized PowerPoint presenta

The lxml XML toolkit for Python

What is lxml? lxml is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language. It's also very fast and memory

PubMed Mapper: A Python library that map PubMed XML to Python object

pubmed-mapper: A Python Library that map PubMed XML to Python object 中文文档 1. Philosophy view UML Programmatically access PubMed article is a common ta

Simple app for visual editing of Page XML files

Name nw-page-editor - Simple app for visual editing of Page XML files. Version: 2021.02.22 Description nw-page-editor is an application for viewing/ed

PAGE XML format collection for document image page content and more
PAGE XML format collection for document image page content and more

PAGE-XML PAGE XML format collection for document image page content and more For an introduction, please see the following publication: http://www.pri

A Web Scraper built with beautiful soup, that fetches udemy course information. Get udemy course information and convert it to json, csv or xml file
A Web Scraper built with beautiful soup, that fetches udemy course information. Get udemy course information and convert it to json, csv or xml file

Udemy Scraper A Web Scraper built with beautiful soup, that fetches udemy course information. Installation Virtual Environment Firstly, it is recommen

A repository that shares tuning results of trained models generated by TensorFlow / Keras. Post-training quantization (Weight Quantization, Integer Quantization, Full Integer Quantization, Float16 Quantization), Quantization-aware training. TensorFlow Lite. OpenVINO. CoreML. TensorFlow.js. TF-TRT. MediaPipe. ONNX. [.tflite,.h5,.pb,saved_model,tfjs,tftrt,mlmodel,.xml/.bin, .onnx]
A python script to convert an ucompressed Gnucash XML file to a text file for Ledger and hledger.

README 1 gnucash2ledger gnucash2ledger is a Python script based on the Github Gist by nonducor (nonducor/gcash2ledger.py). This Python script will tak

Json2Xml tool will help you convert from json COCO format to VOC xml format in Object Detection Problem.

JSON 2 XML All codes assume running from root directory. Please update the sys path at the beginning of the codes before running. Over View Json2Xml t

Txt2Xml tool will help you convert from txt COCO format to VOC xml format in Object Detection Problem.

TXT 2 XML All codes assume running from root directory. Please update the sys path at the beginning of the codes before running. Over View Txt2Xml too

Releases(v0.1.1)
Owner
Joshua Olson
Joshua Olson
.bvh to .mcfunction file converter.

bvh-to-mcf .bvh file to .mcfunction converter

Hanmin Kim 28 Nov 21, 2022
An URL checking python module

An URL checking python module

Fayas Noushad 6 Aug 10, 2022
✨ Un DNS Resolver totalement fait en Python par moi, et en français

DNS Resolver ❗ Un DNS Resolver totalement fait en Python par moi, et en français. 🔮 Grâce a une adresse (url) vous pourrez avoir l'ip ainsi que le DN

MrGabin 3 Jun 06, 2021
A fixture that allows runtime xfail

pytest-runtime-xfail pytest plugin, providing a runtime_xfail fixture, which is callable as runtime_xfail(), to allow runtime decisions to mark a test

Brian Okken 4 Apr 06, 2022
✨ Un bot Twitter totalement fait en Python par moi, et en français.

Twitter Bot ❗ Un bot Twitter totalement fait en Python par moi, et en français. Il faut remplacer auth = tweepy.OAuthHandler(consumer_key, consumer_se

MrGabin 3 Jun 06, 2021
This script allows you to retrieve all functions / variables names of a Python code, and the variables values.

Memory Extractor This script allows you to retrieve all functions / variables names of a Python code, and the variables values. How to use it ? The si

Venax 2 Dec 26, 2021
Convert any-bit number to decimal number and vise versa.

2deci Convert any-bit number to decimal number and vise versa. --bit n to set bit to n --exp xxx to set expression to xxx --r to run reversely (from d

3 Sep 15, 2021
Finds price floor for every single attribute in a given collection

Solana Solanart Scanner Enjoy the Free Code Steps to run Download VS Code

Dalton Nisbett 19 Oct 20, 2022
A Python utility belt containing simple tools, a stdlib like feel, and extra batteries. Hashing, Caching, Timing, Progress, and more made easy!

Ubelt is a small library of robust, tested, documented, and simple functions that extend the Python standard library. It has a flat API that all behav

Jon Crall 638 Dec 13, 2022
Easy compression and extraction for any compression or archival format.

Tzar: Tar, Zip, Anything Really Easy compression and extraction for any compression or archival format. Usage/Examples tzar compress large-dir compres

DanielVZ 37 Nov 02, 2022
A python module to manipulate XCode projects

This module can read, modify, and write a .pbxproj file from an Xcode 4+ projects. The file is usually called project.pbxproj and can be found inside the .xcodeproj bundle. Because some task cannot b

Ignacio Calderon 1.1k Jan 02, 2023
A fancy and practical functional tools

Funcy A collection of fancy functional tools focused on practicality. Inspired by clojure, underscore and my own abstractions. Keep reading to get an

Alexander Schepanovski 2.9k Jan 07, 2023
async parser for JET

This project is mainly aims to provide an async parsing option for NTDS.dit database file for obtaining user secrets.

15 Mar 08, 2022
python package for generating typescript grpc-web stubs from protobuf files.

grpc-web-proto-compile NOTE: This package has been superseded by romnn/proto-compile, which provides the same functionality but offers a lot more flex

Roman Dahm 0 Sep 05, 2021
A simple python implementation of Decision Tree.

DecisionTree A simple python implementation of Decision Tree, using Gini index. Usage: import DecisionTree node = DecisionTree.trainDecisionTree(lab

1 Nov 12, 2021
✨ Une calculatrice totalement faite en Python par moi, et en français.

Calculatrice ❗ Une calculatrice totalement faite en Python par moi, et en français. 🔮 Voici une calculatrice qui vous permet de faire vos additions,

MrGabin 3 Jun 06, 2021
Finger is a function symbol recognition engine for binary programs

Finger is a function symbol recognition engine for binary programs

332 Jan 01, 2023
Gradually automate your procedures, one step at a time

Gradualist Gradually automate your procedures, one step at a time Inspired by https://blog.danslimmon.com/2019/07/15/ Features Main Features Converts

Ross Jacobs 8 Jul 24, 2022
Small Python script to parse endlessh's output and print some neat statistics

endlessh_parser endlessh_parser is a small Python script that parses endlessh's output and prints some neat statistics about it Usage Install all the

ManicRobot 1 Oct 18, 2021
Python program to do with percentages and chances, random generation.

Chances and Percentages Python program to do with percentages and chances, random generation. What is this? This small program will generate a list wi

n0 3 Jul 15, 2021