Rpapo's Translation Assistant.

Post by **hobogunner** » Tue Nov 22, 2011 8:21 pm

As to stop form other threads being taken up with our ramblings, I figured this should be created, so, continue discussion at...

Jonathanasdf's post from the golden time one.

Rpapo's main entry on it:

Come and get it. Download http://mywebpages.comcast.net/rpapo/Nihongo.zip.

The program to be run is at x64\release\nihongo.exe, and it requires two main input parameters: the name of the file to be read in (it must be Unicode TXT format), and the name of the file to be generated as output. Additionally, you can specify the line number (zero-based) where you want to start parsing, and the line number where you want to stop parsing. A complete command line might look like this:
Code: Select all
c:\Nihongo> x64\release\nihongo.exe GoldenTime_1.txt OutputText.txt 1623 1631

Post by **rpapo** » Wed Nov 23, 2011 3:37 am

Some additional information about the program. It was written to help me work on these translations, automating what was within my abilities. I am not a professor of linguistics with a specialty in computer science, but I have been writing code professionally for more than thirty years now.

The program is NOT a translator. It is a parser. That is, it attempts to break Japanese text into it's constituent words, and then once that is done it does exhaustive dictionary lookups on the resultant words, giving a translator much of the raw material he needs to piece together the meaning of the text. It does not attempt to provide the resultant translation, though it provides much of what is needed as input to for a translator program, should someone attempt to write one. That task would require a considerable background in theoretical linguistics, I think.

To use what I have built, in its current form, you need to download the file given in the previous posting and unzip it to a directory/folder all it's own. On my system, this is c:\Projects\Nihongo, though you may wish to just use c:\Nihongo. If you have Microsoft Visual Studio 2008 installed on your system, then you can load the solution file, Nihongo.sln and rebuild the whole thing for yourself. The startup project should be set to "Nihongo". There are several other semi-independent projects in the solution, "Analyzer" and "Parser2", as well as several subcomponents, "JDICT", "JIS0208" and "Juman". More on that stuff later, as it is only relevant to people actually trying to play with the code and modify it.

A copy of the main dictionary file used by the program, EDICT, can be found in the JDICT folder. This file may be updated at any time by downloading a new dictionary from the WWWJDICT web site at http://www.csse.monash.edu.au/~jwb/edict.html. Since this dictionary is constantly improving, updating it every now and then is a good idea. Nothing like getting free improvements!

There are a number of things that need to be done with the program yet, including:

(1) Externalize the personal additions to the dictionary, placing them into a unicode TXT file that you don't have to be a programmer to update.
(2) Create a graphical user interface with (as a minimum) controls for specifying the input file, the output file and which line(s) to parse. Later improvements could include providing a user interface for the maintenance of the personal dictionary extensions.
(3) Improve the parser to handle yet more of the possible verb/adjective conjugations. To do this without requiring a system with 32Gb (or more!) of RAM, I started some while ago a parallel project which determines conjugations on the fly. This is the "Parser2" project. It is far from complete, and not ready to use.

I have been trying to restrain myself from working too much on the program, mainly because I need/want to improve my personal grasp of Japanese. Computer programming for me is fun, and all too tempting when there are other things I ought to be doing with my time. That said, though, writing this program hugely improved my knowledge of how the language works grammatically, and especially in the area of verb and adjective conjugations.

Post by **rpapo** » Wed Nov 23, 2011 3:40 am

Some programmers might ask why I have done this on Windows, and with Microsoft Visual Studio 2008. The answer is simple: my paycheck comes from working with both tools, and they are essentially free for me. Why don't I use Visual Studio 2010, then? Again simple: because I think the thing is a pig. I hope Micro$oft gets their act back in gear for Visual Studio 2012.

jonathanasdf · Post by **jonathanasdf** » Wed Nov 23, 2011 7:41 am

I was trying to translate it to gnu c++ but kinda gave up after finishing Nihongo.cpp and then realizing there's 3 other files that are even longer...

The reason was I was trying to add an additional user_definitions.txt file that would be parsed for "word":"defn" pairs and automatically AddWord them, but I absolutely refuse to work with FILE and wchar_t* arrays so somehow I ended up translating it... (fstream and wstring ftw!) Additionally, I thought that if it can be compiled by g++ it'll be a lot more lightweight and people on macs and *nux could use it too, and people on windows too, though they'd have to install cygwin or something like that, it'd be a lot more lightweight than having to install vs express or something.

But meh I don't think I'll be touching this source again..... good luck on it, I'll probably still use it so keep up the good work.

Post by **rpapo** » Wed Nov 23, 2011 7:52 am

jonathanasdf wrote:I was trying to translate it to gnu c++ but kinda gave up after finishing Nihongo.cpp and then realizing there's 3 other files that are even longer...
...
But meh I don't think I'll be touching this source again..... good luck on it, I'll probably still use it so keep up the good work.

Worthy idea, but the GNU and M$ worlds are far apart, unfortunately. M$FT likes it that way . . .

In any case, I coded it with wchar_t because it needed to be Unicode. FILE is something I'm used to working with. I use STL for some things, but have found that it can also be quite a pig. It's not in this project, but I've built my own List and Map template classes to get around some of the excess overhead.

Could be worse: I could be using MFC and CString . . .

Post by **rpapo** » Wed Nov 23, 2011 8:40 am

jonathanasdf wrote:...lightweight...

Actually, most of the expense of using this program is in how much memory is consumed when building the dictionary, which is a collection of STL maps, sets and strings. That part isn't going to get significantly lighter by changing compilers. What will make it far lighter will be the Parser2 project . . . once I get back to working on it.

I use MSVC simply because it works and has no incremental cost for me. I already use the stuff because I use it on the job. They pay me well to do so, so I keep myself current so my pay continues to stay current too . . . There's something about having a wife and kids, and all the associated expenses, that makes you want to make sure you keep on getting paid.

That said, I also believe in limiting my paid work to 40-50 hours a week. After that, I get to work on whatever I feel like working on. For the time being, that is a mix of home maintenance, church service and Japanese. And of course, eating and sleeping. As a human being I'm kind of addicted to those.

jonathanasdf · Post by **jonathanasdf** » Wed Nov 23, 2011 2:45 pm

wstring is basic_string<wchar_t> so it deals with internationalization fine too. I wasn't complaining so much about wchar_t but wchar_t* AKA a c-string, instead of a c++ string.

By lightweight I mean not having to open up VS to compile, and thus easier for other people to make additional changes to it. It takes like 5 minutes for VS to open on my laptop so if I could avoid opening it it would be a lot better... But then there are people like you who prefer working with VS so meh this could be good or bad depending on the person heh.

Post by **rpapo** » Wed Nov 23, 2011 2:58 pm

jonathanasdf wrote:wstring is basic_string<wchar_t> so it deals with internationalization fine too. I wasn't complaining so much about wchar_t but wchar_t* AKA a c-string, instead of a c++ string.

Well, from where I sit, STL strings are expensive in terms of CPU, mainly because of the heap management. My background starts in the days where we didn't even have C, and used Assembler because there was nothing better. So my notion of fast and efficient is very different from your average Joe.

Not that I've spent much time at all tuning this particular program for speed . . .

jonathanasdf wrote:By lightweight I mean not having to open up VS to compile, and thus easier for other people to make additional changes to it. It takes like 5 minutes for VS to open on my laptop so if I could avoid opening it it would be a lot better... But then there are people like you who prefer working with VS so meh this could be good or bad depending on the person heh.

Ouch. VS comes up very quickly for me, but I'm running Windows 7 x64 on a quad-core 2.80GHz AMD with 6Gb of RAM. My work machine is even better. Now VS2010 is slow in starting up, which is why I use VS2008 instead.

jonathanasdf · Post by **jonathanasdf** » Wed Nov 23, 2011 3:10 pm

Yep, you should use whatever you're most used to. Since I learned programming through languages like java and C#, I would rather throw my computer out of the window than work with c style strings.

And yeah, I'm on a 5-year-old laptop so...

Post by **rpapo** » Thu Nov 24, 2011 12:35 pm

I've just posted an update to the program. There was a bug in the conjugation of する. I noticed it when the parser didn't recognize したら for what it was: the plain positive conditional conjugation of "to do".

Mystes · Post by **Mystes** » Thu Nov 24, 2011 2:06 pm

Hey, rpapo, where did you find the dictionnaries for that?

Website · Post by **Darklor** » Thu Nov 24, 2011 2:14 pm

Didnt he speak of WWWJDIC ? Maybe you should read the second post again?

Post by **rpapo** » Thu Nov 24, 2011 2:16 pm

Kira0802 wrote:Hey, rpapo, where did you find the dictionnaries for that?

Read the second message in this topic. In any case, the ZIP file with my program and it's source code includes a copy of the dictionary file from WWWJDICT.

EDIT: The dictionary file is in a really strange encoding (ask Jim Breen about that). Part of my code is what it takes to read the crazy file, converting it to unicode along the way.

Mystes · Post by **Mystes** » Thu Nov 24, 2011 3:33 pm

Thanks...do you know if it'd work if I put a Chinese dictionary (CEDICTX) instead of the Japanese one?

Post by **rpapo** » Thu Nov 24, 2011 3:38 pm

Kira0802 wrote:Thanks...do you know if it'd work if I put a Chinese dictionary (CEDICTX) instead of the Japanese one?

Almost guarantee you it won't. The program was written specifically with the oddities of Japanese in mind, and I haven't the foggiest notion of what other monsters await in the shadows with Chinese.

ばか！バカ！　馬鹿ー月！

Rpapo's Translation Assistant.

Rpapo's Translation Assistant.

Re: Rpapo's Translation Assistant.

Re: Rpapo's Translation Assistant.

Re: Rpapo's Translation Assistant.

Re: Rpapo's Translation Assistant.

Re: Rpapo's Translation Assistant.

Re: Rpapo's Translation Assistant.

Re: Rpapo's Translation Assistant.

Re: Rpapo's Translation Assistant.

Re: Rpapo's Translation Assistant.

Re: Rpapo's Translation Assistant.

Re: Rpapo's Translation Assistant.

Re: Rpapo's Translation Assistant.

Re: Rpapo's Translation Assistant.

Re: Rpapo's Translation Assistant.