In the Ozone forum, I have some results, subject to controversy and debugging, still, but working samples.
A practicle issue I didn't want to mention at first: parsing the text and posters "out" of html documents is a pain in the behind.
In terms of execution speed I mean: as it stands, the script takes 1/3 of a second to parse a given set of keywords on a given page.
Am improving this by using caching techniques, but the definitive improvement would be removing the "strip html" part.
Plus I got the posters/post count parsing at hands, so...
Is there a way I can discuss this out of the forums with one or more mods, including TP?
I'd love at least a read access to the db, even if there is no ftp, so I can kill the html parsing code and make several tests without each test
taking 5 minutes.
It would help things progress *a lot* faster at this stage.
Email addies: mauro@mydomain.com or brundle [twenty one] AT hotmail.stuff
Tyberius Prime
Maniac (V) Mad Scientist with Finglongers
From: Germany Insane since: Sep 2001
posted 05-18-2006 23:13
The DB is copyright Dr O. - I can't just hand it out I'm afraid.
But what stops you from using your parser to create whatever format you need - and store that instead of the raw html files?
How do you propose keeping up with the changing nature of the Asylum, btw?
Obviously, I am not asking for anyone to hand me the db out.
But a mere direct access to the data layer.
Basically, I'd had assumed the data layer was a data layer. Eg. That posts contents and posters and posts labels and such
were data stored in fields of tables in a database.
I'd had assumed the sink was produced the same way.
quote:
But what stops you
I have to ask for some solid understanding on your side on this. Ready?
When I say "hectic schedule", "work a lot", "top consultant", I mean I happen to spend three of four days in a row
without a second of sleep due to the requirements of both my studies and job.
I haven't been out for four weeks right now, for example: no-thing private except Mother's day in four weeks.
Not a second aside eating, working, sleeping.
So, answer a) to your question is: time.
Answer b) to your question is: productivity on a broader scale.
I can't afford producing "junk code" at all levels of a big project.
I can't afford tweaking an html parser to make it like the Asylum for myself.
I can't afford having the AI system take twice the time it would need each time I test run it because it has to parse the html
AND then process it.
Each time I add a keyword, I have to make several test passes for the keyword set.
If I edit ten keywords, I am in for ten test runs which will last something like 5 minutes each (since the app now uses all keyword sets on all pages).
Had I access to the data layer, this could be cut in two AT LEAST. Considering data analysis is way faster
than html parsing, we can safely assume the delay would be cut in tree.
So the whole development is impacted by this factor: it will take me weeks this way to do what could be done in days.
There are loads of other "strictly technical reasons" I can quote.
One is obvious: it's related to your second question. (How do you propose...)
If I am developping for the applicative layer, then yes, I have to keep up with each and every layout / interface change.
If I am developping for the data layer, eg. the DB, then things get more simple: an SQL layer will handle fetching the info I need,
which would separate data processing from layout constraints, and which would make changes as easy as implement as...
updating the SQL, not the php.
Anyone who knows which fields are fed to the AI could then update the SQL accordingly.
It's what these technologies are designed for, scalability, and it's the reason why *slow* databases are favored to *fast* flat files,
and why the Asylum mostly relies on db.
Hell, it's the reason why you put the grail together in first place: making data scalable.
I didn't want to get into the details publicly, that's why I asked for a more direct discussion.
--------------------------------------------------------------
Please bear in mind that such a system, again, would be easy to adapt to different forums, different types of forums,
and different areas or tasks in a given forum.
Please take all precautions you think are worthy, have me sign documents, whatever, but take a clear stand:
do you want such a thing to happen? If you do, you have to trust me -a bit- and help me find a solution
that will not allow me to cause problems, and that will allow me to work productively.
~nuff hacking script that will end in the trashcan on this one, I am not in for making an html parser, I am in for making an AI.
Plus that will make me feel that the idea is welcome : I sure wouldn't want to cause troubles because
I dared to propose an idea and claim I can implement it easilly.
In the same vein: how do you expect an architect to demonstrate his skills if you ask him to build a skyscraper and give
him clay for sketches?
From: Rochester, New York, USA Insane since: May 2000
posted 05-19-2006 21:31
I would download a copy of the grail code, and I would then take your parsed data and put it into a local test database.
You can then use this for testing and development. Once you have a solid base line you would then look into putting that back into the main source repository with the hope of getting it moved into the asylum baseline during the next code update.
Ack. I am a huge quality freak so this doesn't sound to me like the really real deal, but it is an option.
In the same vein, I have two setups for gfx development, a 20 inches desktop with an Nvidia gpu and a laptop with a Radeon
Really.
Huge quality freak.
Tyberius Prime
Maniac (V) Mad Scientist with Finglongers
From: Germany Insane since: Sep 2001
posted 05-20-2006 17:57
Look - you'll never get access to the data directly.
You will be called (think webservice) with a single thread, layouted in whatever format you want, xml, comma seperated, plaintext, whatever, then your AI does it's thing, and returns something along 'archive/reject and suggested Forum'.
If you're wasting your own time, because you have not propertly seperated your html parser from your AI... shame on you .
It is a trivial 5 minute addition to add a simple caching to an html parser - much faster actually than to change it to a db access layer.
In the future, I suggest you keep your posts to one screen page. My life is nearly as busy as yours, though I do have a social life, and I'd rather not read through such amounts of text from you.
Who said he would be the first to help me if this thing comes true?
Oh, and I wouldn't want to affect the social life which seems to make you a 'tad agressive already
Webservices? Hmmm, sounds slow though, considering the engine take 3-4 seconds to evaluate one page.
Anyway, updates in the Ozone forum...
From: Rochester, New York, USA Insane since: May 2000
posted 05-22-2006 02:10
TP. I am trying to understand what you are offering _Mauro.
Would he have a REST style API that he could access? Does this need to be developed? Last time I looked into the source (about a year ago) this was not in there.
I see it only needing two url's for inspection, and maybe another URL to send the results to.
http://www.ozoneasylum.com/sink
- returns a list 100 thead ids [GET]
- accepts a list of id's and a flag which sets (P,D,U) perserve|delete|undecided for the id. [POST]
I would that this information should be stored in the bot's local database so that it can be re-examined at any time without needing to hit the asylum.
There would have to be an authentication step based on the inmate's username/password I would believe.
I would reccommend that the URL have some sort of throttle associated with it (the bot running away and clobbering the server would not make anyone happy), and in inital production the bots recomendations get put into a holding area that requires manual user authorization when starting this out.
Ultimately the bot would have a lot of work to do in the beginnig as there are a ton (10k?) of threads that need to be examined, but after the initial run the bot might only need to be run on a weekly basis to pick up the remaining threads.
I believe providing a webservice for this would be in the best interest of all concerned. Providing just an XML representation would definately save the asylum on bandwidth and would speed up things on _Mauro's end as parsing a structured XML file is going to be a number of magnitudes faster than parsing the asylums XHTML structure.
If this needs to be implemented let me know via email.
Tyberius Prime
Maniac (V) Mad Scientist with Finglongers
From: Germany Insane since: Sep 2001
posted 05-22-2006 10:36
No - this is not going to be 'pull by the bot' at all.
The asylum has a regular cron, that will then send a couple dozen or so pages *to* the bot, which will then do it's magic.
Then the asylum will use the results of the bot to archive the threads.
The bot is the webservice - the asylum is the client.
Mhh... ok. And the "protocol?" XML? Would be cool. Formatted text with a given spec would be cool as well.
Shouldn't we use something like IM to coordinate easilly?
Anyway, I am now making this move fast: my goal is to break the 80-90% success rate this week by refining the keyword search
and implementing a well balanced set of badkeywords and posters.
There are other technicalities: as it stands, the bot "has" to learn and evolve.
For example, the concept Ajax was absent from the dhtml forum for the past years, so it is not a relevant keyword yet,
it confuses the dhtml threads detection. But in the future, it will have to be added to the knowledge base and used.
However, making it evolve "too fast" is bad, because a single keyword can have a large impact on the detection process.
So, when the first version is up and running, I'll have to set the update system and set it so that the kb gets updated only
with the needed amount of modifications for a given period.
The AI can spot it's own shortcomings by spotting it's own inconsistencies. If a thread has a high archivability for example,
but the AI can't decide where to put it, it will need more good keywords and will ask for them.
Tyberius Prime
Maniac (V) Mad Scientist with Finglongers
From: Germany Insane since: Sep 2001
posted 05-22-2006 12:34
SOAP - you define the interface.
Please bear in mind that a bad poster list sounds like specific discrimination and is not acceptable.
Only have good keywords, bad keywords and good posters so far (besides, it's not workable to use "bad posters",
it just doesn't make sense to the ai as it is).
Hmmm... ok, SOAP sounds ok for the service itself, a calling method could be "treatPages($input as string or array)"
and the engine would return a simple list associating threads to archives.
For updating the keywords and posters, and poster ratings per forum though, it's *very* delicate.
As I said, a single info impacts the whole "point of view" of the AI, and it will be this way until the knowledge base gets fairly heavy.
The system should be allow to send inquiries to mods.
Should it do it using email?
In any event, "replies", new keywords, etc. should not be put immediately in the KB.
I was envisionning this:
1) For posting replies to AI inquiries, Asylum mods should have a logon / restricted access to my server.
2) Should everybody who is a poster have a right to review the KB? I do think so.
3) One mod should not be able to alter informations about himself.
4) It should be possible to test proposed changes, and see how they impact a bunch of pages.
5) Changes should be heavilly limited over time: only a few changes should be accepted for each update.
Tyberius Prime
Maniac (V) Mad Scientist with Finglongers
From: Germany Insane since: Sep 2001
posted 05-22-2006 19:15
Tell me what exactly is the process to 'update' the AI - does it not learn by it's own decisions?
It's a so called 2nd generation Expert System. So, short answer: no.
Because for a tedious task like the Asylum, if the engine makes one wrong assumption, then it will start learning more based on the mentionned wrong decision and will quickly drift.
So we can't trust the machine to do 100% of the job on it's own. Later on, maybe.
When it finds an ambiguity (what an ambiguity is is yet to determine), it will ask "whoever it should ask" precise questions.
As it stands now, it's hard to tell what an ambiguity is, but it can be several things, for example.
- if a thread can't be archived (is not "relevant to a forum enough").
Then the AI should say something like
"this thread could not be selected for an archive, however, it showed a high relevance to the following archives: <choices>."
"Should the thread be classified as misc? "
If the answer is no, the AI will then ask something like
"Is the thread relevant to one of the proposed archives?"
And if the answer is yes, the AI will ask the mod to add more keywords or posters to the corresponding archive keywords.
What an ambiguity is, as I said, is yet to determine.
It can be any of the following, for the time being, but later on, only a few cases will remain:
- if a thread is highly archivable, but is not relevant to any topic (example detailed above)
- if a thread is relevant to a given topic by posters, and equally relevant to another topic by keywords (rare case, very rare)
- if a thread is highly relevant to forums, but has a low archivability....
Etc. all cases that show a lack of knowledge and therefore, require a knowledge base enhancement.
Look for "inference engine" and "second generation expert system", that's what my AI is.
getPreferredForum(int thread_id, array posters, array posts, array dates) :: returns (preferred archive) or dies on "AmbiguousClassification" and traps the error by requesting more info
// Suggests the best suited archive for this thread
getRelevance(int thread_id, array posters, array posts, array dates) :: returns (array relevance)
// Returns the relevance for this thread to each archive as an associative array
getArchivability(int thread_id, array posters, array posts, array dates) :: returns (float archivability)
// Returns the overall archivability for the thread
Proposed public interface.
Opinions? (is this "enough" or appropriate to your needs?)
The service should be up and running by tomorrow, probably based on the nusoap system.
An update will be made to the code, to make it take posters in account.
I "could" turn my code to c++ and a static library, to boost the performances.
Tyberius Prime
Maniac (V) Mad Scientist with Finglongers
From: Germany Insane since: Sep 2001
posted 05-24-2006 14:06
Two words: One Call.
In fact,r eturn a tuple of ( archivability, sortedListOfArchives), no floats in the archivability (map to boolean - caring wether .87 is good enough is your AI's problem ).
I'm pretty busy these days. But I'll probably get you a couple thousand calls sometime early next week.
Ok, one call and an assoc array as the return value - for the Asylum-beyondwo bridge.
But bear in mind that the engine will need to evolve and will need tips from mods, so aside the main remote procedure,
I should probably put together several public methods that will help streamline the knowledge base enhancements.
(mods receive a request for informations fromt the bot, they answer the template email after having filled some fields, the automod automatically
uses the received input appropriately...).
Hullo. I am a tad late (well, clearly late) and I just wanted to tell you why, a matter of politeness.
The results I obtained so far sound promising, automod seems to work a threat, and only gets better with each iteration of the project.
I want it to work ideally, and as am not pressed by time, don't want to rush tiny updates for the sake of shouting "update".
I have been busy in real world, the year is ending, and the final projects are being closed and delivered (3 applications and a couple of tests remain).
And I am between two consulting missions, meeting the new customer, for a new, harder, better paid, more interesting challenge, tomorrow.
...That's how the life of the consultant goes: it moves at a tremendous speed.
Within the next week, the four top priorities on my list are new mission, lisp app, networking test, and automod.
This means that the SOAP interface will pop-up "sometime during the next week" and that I can't be more precise: will just catch the tiny free-time frame which will appear over this period to
polish that piece of code.
Mad libs contacted with the exception of vvrose, for the lack of an email adress, will catch her on Q asap.
Deadline set by school: automod *must* be ready this month, at least my part.
Soap interface still not public, nusoap and my server config tend to clash on odd things for now, and it's not the most important priority methinks (the most important is reaching a 90% threads identified,
hence the emphasis on keywords).
And a legal statement, written, public, which will sound silly but is necessary:
- in any event, and in the limits of the accessibility of domains owned by Mauro Colella, automod will remain a free service for the Asylum, and related parties, for 10 years starting from now
(enough time for the technology behind it to disappear or turn into something completely different, and for it to need a partial or complete re-design).
Tyberius Prime
Maniac (V) Mad Scientist with Finglongers
From: Germany Insane since: Sep 2001
posted 06-15-2006 13:20
Ok, here's my three step plan for integrating your Automod
1. Tie it into the ->smartSink
2. Watch how it works out, refine together with you if the accuracy is too low.
3. Automate the process.
Sounds fine, except that for point 3: the automod needing to evolve over time must have that possibility to ask mad libs every now and then if it does things
right, this need will largely decrease, as it is trained and the kb enhanced, but it's better in the long run if it remains to an extent.
Anyway, while mad libs are working on the kb, I'll try to layout what we call a "report", private and confidential for me, TP, mad libs, which shows all the technical whereabouts
of it: speaking about UML ... (plus I have to make such a doc anyway for any program I release)
@kuckus and velvetrose, where, and how, can I reach you about this thing? Wasn't able to get a hold of you so far, using your forum profile info.
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Okay... so, real life did really cool things to me, but got -unexpectedly- demanding.
Long story made short: I am "moving upwards" in the consulting hierarchy towards being a senior consultant, and actively involved in big projects.
For instance, some high end pj about archeology and 3d, in collaboration with several universities.
Anyway, classes are almost over: two things left to deliver, a presentation and the final "Xiego / go game framework"
And... automod.
I *must* deliver the application plan and project draft today, before midnight.
Which will allow me to formalize it, and whichever details were left "vague", diagrams, formalized rules, a torough analysis which I had only roughly sketched so far.
Next week, I'll commit to making it true and active, with a huge delay compared to my early expectations:
making a fuzzy logics analyser that recognizes almost every -mad- discussion at the Asylum was a first time for me, and turned out to be a lot bigger than I expected.
I'll host the project draft / diagrams / plans publicly, unless TP or others want me to keep it discreet, for the sake of showing you
- how big - and how many rethinkings got in the way. But it's at hands, and is the winner I wanted.
Haven't decided yet.
Honestly enough, I am all for keeping them closed, but publishing the technical report.
Would give enough informations to have someone knowledgeable reimplement it using any s-side language (eg. the thinking behind).
I certainly will keep hosting the service for free for this, and other interesting forums, for a lifetime if I can afford it (which shouldn't be a problem at all - once it's up and running,
the evolution can be handed to forum mods).
But I see ways in which the code could evolve toward real business applications, so as honestly as I can: probably won't release the source.
Which makes think that there is no problem for me to release the source of the "genetic sudoku" solver if you care... off to the dhtml forums.
Ok, mega-hectic schedule of sorts: I succeeded at my nightly exams, and am doing more things than ever
before professionally, moving at a tremendous speed.
Just got the mark for the automod project, the report posted right above: 5.3/6, and it's ambiguous wether it is
an ES, or a Wizard, but it works well, hence the mark.
And the plan is right, but the plan doesn't only document the bot's job: it document's the way mods should use the bot as well.
Now that I am certain I got both right, I cam back here today for this single info, the project continues.
I'll shortly be able to update my binaries section to demonstrate all of the projects I managed from A to Z lately,
and automod will get his SOAP.