From john at repetae.net Wed Feb 13 18:50:23 2008 From: john at repetae.net (John Meacham) Date: Wed Feb 13 18:48:56 2008 Subject: [jhc] Re: Ids in Jhc. In-Reply-To: References: Message-ID: <20080213235022.GB9671@sliver.repetae.net> (I'm gonna forward this response to the jhc list too, if that's okay) On Thu, Feb 14, 2008 at 12:26:49AM +0100, Lemmih wrote: > I've been wondering about your use of Ids in Jhc. > As far as I can see, the rules for Ids goes as following: > * named ids are odd > * unnamed ids are even > * id '0' has special significance. > > Can you tell me a bit about how unnamed ids are used, why id '0' has > special meaning, and how you generate new, unique unnamed ids. Yes. that is correct. named and unnamed ids behave the same way, except named ids get a pretty name when printed out as opposed to just a numebr and are generally derived from a user supplied name. Also, all 'top level' names are always 'named' to ensure they remain globally unique. named Ids are odd, these are looked up in a table that is implemented via a hash table written in c in the StringTable/ directory. due to an intentional quirk of its implementation, it only generates odd, positive values. Conversion between a number and its string form is constant time in both directions. zero is a 'special' id in that it can be treated normally except it can never appear in an expression, only in a binder. so \ v0 -> ... is okay, but \ x -> .. v0 ... is not. This means you can always use 0 as a binder when you know a value is not used without worrying about shadowing a real definition. unnamed ids are arbitrary positive even numbers, there is no particular need to allocate them in a special way, they are _not_ globally unique so you just need to ensure you don't shadow any existing variables when assigning new ones. in general, this consists of having a set of 'in scope' names (which is often something you need to keep track of anyway) and you choose some value that is not in the 'in scope' set and add it to the set. this ensures ids don't shadow each other, but not that they are unique. negative ids are used inside of a few specialized routines internally and should never make it out into the world. for instance, in the unification algorithm they are used to represent alpha-renamed values so we don't need to scan for used names before hand, however the terms returned have their original ids restored. hope this helps... It is quite bothersome to me that 'Id' is just an Int and not a newtype... that is something I have been slowing getting the code ready to fix for a while now.. John -- John Meacham - ?repetae.net?john? From lemmih at gmail.com Wed Feb 13 19:23:08 2008 From: lemmih at gmail.com (Lemmih) Date: Wed Feb 13 19:40:59 2008 Subject: [jhc] Re: Ids in Jhc. In-Reply-To: <20080213235022.GB9671@sliver.repetae.net> References: <20080213235022.GB9671@sliver.repetae.net> Message-ID: On Feb 14, 2008 12:50 AM, John Meacham wrote: > (I'm gonna forward this response to the jhc list too, if that's okay) > > > On Thu, Feb 14, 2008 at 12:26:49AM +0100, Lemmih wrote: > > I've been wondering about your use of Ids in Jhc. > > As far as I can see, the rules for Ids goes as following: > > * named ids are odd > > * unnamed ids are even > > * id '0' has special significance. > > > > Can you tell me a bit about how unnamed ids are used, why id '0' has > > special meaning, and how you generate new, unique unnamed ids. > > Yes. that is correct. > > named and unnamed ids behave the same way, except named ids get a pretty > name when printed out as opposed to just a numebr and are generally > derived from a user supplied name. Also, all 'top level' names are > always 'named' to ensure they remain globally unique. > > named Ids are odd, these are looked up in a table that is implemented > via a hash table written in c in the StringTable/ directory. due to an > intentional quirk of its implementation, it only generates odd, positive > values. Conversion between a number and its string form is constant > time in both directions. > > zero is a 'special' id in that it can be treated normally except it can > never appear in an expression, only in a binder. so \ v0 -> ... is okay, > but \ x -> .. v0 ... is not. This means you can always use 0 as a binder > when you know a value is not used without worrying about shadowing a > real definition. > > > unnamed ids are arbitrary positive even numbers, there is no particular > need to allocate them in a special way, they are _not_ globally unique > so you just need to ensure you don't shadow any existing variables when > assigning new ones. in general, this consists of having a set of 'in > scope' names (which is often something you need to keep track of anyway) > and you choose some value that is not in the 'in scope' set and add it > to the set. this ensures ids don't shadow each other, but not that they > are unique. > > > negative ids are used inside of a few specialized routines internally > and should never make it out into the world. for instance, in the > unification algorithm they are used to represent alpha-renamed values so > we don't need to scan for used names before hand, however the terms > returned have their original ids restored. > > hope this helps... > > It is quite bothersome to me that 'Id' is just an Int and not a > newtype... that is something I have been slowing getting the code ready > to fix for a while now.. I'd like to fix it in another way, if you have no objects. By my count there are 120 unique atoms in base-1.0.hl. Those 120 atoms are saved ~2344 times each (there's a total of 281338 atoms). Compressing the files with gzip reduces the negative effect on file size but the overhead of parsing 281k atoms is still there. On my system, testing with HelloWorld.hs, parsing those atoms took 16% of the runtime and 27% of the memory usage. I propose scrapping the current system of random atom ids in favor for a per-module map of symbols. Using an ADT for the ids would make the system more transparent. I'll have to investigate the performance impact of this. -- Cheers, Lemmih From lemmih at gmail.com Wed Feb 13 19:29:43 2008 From: lemmih at gmail.com (Lemmih) Date: Wed Feb 13 19:44:29 2008 Subject: [jhc] Re: Ids in Jhc. In-Reply-To: <20080213235022.GB9671@sliver.repetae.net> References: <20080213235022.GB9671@sliver.repetae.net> Message-ID: On Feb 14, 2008 12:50 AM, John Meacham wrote: > (I'm gonna forward this response to the jhc list too, if that's okay) > On Thu, Feb 14, 2008 at 12:26:49AM +0100, Lemmih wrote: > > I've been wondering about your use of Ids in Jhc. > > As far as I can see, the rules for Ids goes as following: > > * named ids are odd > > * unnamed ids are even > > * id '0' has special significance. > > > > Can you tell me a bit about how unnamed ids are used, why id '0' has > > special meaning, and how you generate new, unique unnamed ids. [snip] > named Ids are odd, these are looked up in a table that is implemented > via a hash table written in c in the StringTable/ directory. due to an > intentional quirk of its implementation, it only generates odd, positive > values. Conversion between a number and its string form is constant > time in both directions. It looks like it has been replaced by a Haskell implementation. -- Cheers, Lemmih From lemmih at gmail.com Wed Feb 13 19:39:37 2008 From: lemmih at gmail.com (Lemmih) Date: Wed Feb 13 19:44:29 2008 Subject: [jhc] Re: Ids in Jhc. In-Reply-To: References: <20080213235022.GB9671@sliver.repetae.net> Message-ID: On Feb 14, 2008 1:23 AM, Lemmih wrote: > > On Feb 14, 2008 12:50 AM, John Meacham wrote: > > (I'm gonna forward this response to the jhc list too, if that's okay) > > > > > > On Thu, Feb 14, 2008 at 12:26:49AM +0100, Lemmih wrote: > > > I've been wondering about your use of Ids in Jhc. > > > As far as I can see, the rules for Ids goes as following: > > > * named ids are odd > > > * unnamed ids are even > > > * id '0' has special significance. > > > > > > Can you tell me a bit about how unnamed ids are used, why id '0' has > > > special meaning, and how you generate new, unique unnamed ids. > > > > Yes. that is correct. > > > > named and unnamed ids behave the same way, except named ids get a pretty > > name when printed out as opposed to just a numebr and are generally > > derived from a user supplied name. Also, all 'top level' names are > > always 'named' to ensure they remain globally unique. > > > > named Ids are odd, these are looked up in a table that is implemented > > via a hash table written in c in the StringTable/ directory. due to an > > intentional quirk of its implementation, it only generates odd, positive > > values. Conversion between a number and its string form is constant > > time in both directions. > > > > zero is a 'special' id in that it can be treated normally except it can > > never appear in an expression, only in a binder. so \ v0 -> ... is okay, > > but \ x -> .. v0 ... is not. This means you can always use 0 as a binder > > when you know a value is not used without worrying about shadowing a > > real definition. > > > > > > unnamed ids are arbitrary positive even numbers, there is no particular > > need to allocate them in a special way, they are _not_ globally unique > > so you just need to ensure you don't shadow any existing variables when > > assigning new ones. in general, this consists of having a set of 'in > > scope' names (which is often something you need to keep track of anyway) > > and you choose some value that is not in the 'in scope' set and add it > > to the set. this ensures ids don't shadow each other, but not that they > > are unique. > > > > > > negative ids are used inside of a few specialized routines internally > > and should never make it out into the world. for instance, in the > > unification algorithm they are used to represent alpha-renamed values so > > we don't need to scan for used names before hand, however the terms > > returned have their original ids restored. > > > > hope this helps... > > > > It is quite bothersome to me that 'Id' is just an Int and not a > > newtype... that is something I have been slowing getting the code ready > > to fix for a while now.. > > I'd like to fix it in another way, if you have no objects. > > By my count there are 120 unique atoms in base-1.0.hl. Those 120 atoms > are saved ~2344 times each (there's a total of 281338 atoms). Oh, I should mention that I'm counting TVr's. Those a apparently mostly module names. -- Cheers, Lemmih From lemmih at gmail.com Wed Feb 13 19:49:04 2008 From: lemmih at gmail.com (Lemmih) Date: Wed Feb 13 19:47:39 2008 Subject: [jhc] Re: Ids in Jhc. In-Reply-To: References: <20080213235022.GB9671@sliver.repetae.net> Message-ID: On Feb 14, 2008 1:39 AM, Lemmih wrote: > > On Feb 14, 2008 1:23 AM, Lemmih wrote: > > > > On Feb 14, 2008 12:50 AM, John Meacham wrote: > > > (I'm gonna forward this response to the jhc list too, if that's okay) > > > > > > > > > On Thu, Feb 14, 2008 at 12:26:49AM +0100, Lemmih wrote: > > > > I've been wondering about your use of Ids in Jhc. > > > > As far as I can see, the rules for Ids goes as following: > > > > * named ids are odd > > > > * unnamed ids are even > > > > * id '0' has special significance. > > > > > > > > Can you tell me a bit about how unnamed ids are used, why id '0' has > > > > special meaning, and how you generate new, unique unnamed ids. > > > > > > Yes. that is correct. > > > > > > named and unnamed ids behave the same way, except named ids get a pretty > > > name when printed out as opposed to just a numebr and are generally > > > derived from a user supplied name. Also, all 'top level' names are > > > always 'named' to ensure they remain globally unique. > > > > > > named Ids are odd, these are looked up in a table that is implemented > > > via a hash table written in c in the StringTable/ directory. due to an > > > intentional quirk of its implementation, it only generates odd, positive > > > values. Conversion between a number and its string form is constant > > > time in both directions. > > > > > > zero is a 'special' id in that it can be treated normally except it can > > > never appear in an expression, only in a binder. so \ v0 -> ... is okay, > > > but \ x -> .. v0 ... is not. This means you can always use 0 as a binder > > > when you know a value is not used without worrying about shadowing a > > > real definition. > > > > > > > > > unnamed ids are arbitrary positive even numbers, there is no particular > > > need to allocate them in a special way, they are _not_ globally unique > > > so you just need to ensure you don't shadow any existing variables when > > > assigning new ones. in general, this consists of having a set of 'in > > > scope' names (which is often something you need to keep track of anyway) > > > and you choose some value that is not in the 'in scope' set and add it > > > to the set. this ensures ids don't shadow each other, but not that they > > > are unique. > > > > > > > > > negative ids are used inside of a few specialized routines internally > > > and should never make it out into the world. for instance, in the > > > unification algorithm they are used to represent alpha-renamed values so > > > we don't need to scan for used names before hand, however the terms > > > returned have their original ids restored. > > > > > > hope this helps... > > > > > > It is quite bothersome to me that 'Id' is just an Int and not a > > > newtype... that is something I have been slowing getting the code ready > > > to fix for a while now.. > > > > I'd like to fix it in another way, if you have no objects. > > > > By my count there are 120 unique atoms in base-1.0.hl. Those 120 atoms > > are saved ~2344 times each (there's a total of 281338 atoms). > > Oh, I should mention that I'm counting TVr's. Those a apparently > mostly module names. A count of atoms that are forced shows a different result: 258155 saved atoms, 7390 unique atoms. Not as bad as the TVr count but still not good. -- Cheers, Lemmih From lemmih at gmail.com Wed Feb 13 19:55:43 2008 From: lemmih at gmail.com (Lemmih) Date: Wed Feb 13 19:54:19 2008 Subject: [jhc] Re: Ids in Jhc. In-Reply-To: References: <20080213235022.GB9671@sliver.repetae.net> Message-ID: On Feb 14, 2008 1:49 AM, Lemmih wrote: > > On Feb 14, 2008 1:39 AM, Lemmih wrote: > > > > On Feb 14, 2008 1:23 AM, Lemmih wrote: > > > > > > On Feb 14, 2008 12:50 AM, John Meacham wrote: > > > > (I'm gonna forward this response to the jhc list too, if that's okay) > > > > > > > > > > > > On Thu, Feb 14, 2008 at 12:26:49AM +0100, Lemmih wrote: > > > > > I've been wondering about your use of Ids in Jhc. > > > > > As far as I can see, the rules for Ids goes as following: > > > > > * named ids are odd > > > > > * unnamed ids are even > > > > > * id '0' has special significance. > > > > > > > > > > Can you tell me a bit about how unnamed ids are used, why id '0' has > > > > > special meaning, and how you generate new, unique unnamed ids. > > > > > > > > Yes. that is correct. > > > > > > > > named and unnamed ids behave the same way, except named ids get a pretty > > > > name when printed out as opposed to just a numebr and are generally > > > > derived from a user supplied name. Also, all 'top level' names are > > > > always 'named' to ensure they remain globally unique. > > > > > > > > named Ids are odd, these are looked up in a table that is implemented > > > > via a hash table written in c in the StringTable/ directory. due to an > > > > intentional quirk of its implementation, it only generates odd, positive > > > > values. Conversion between a number and its string form is constant > > > > time in both directions. > > > > > > > > zero is a 'special' id in that it can be treated normally except it can > > > > never appear in an expression, only in a binder. so \ v0 -> ... is okay, > > > > but \ x -> .. v0 ... is not. This means you can always use 0 as a binder > > > > when you know a value is not used without worrying about shadowing a > > > > real definition. > > > > > > > > > > > > unnamed ids are arbitrary positive even numbers, there is no particular > > > > need to allocate them in a special way, they are _not_ globally unique > > > > so you just need to ensure you don't shadow any existing variables when > > > > assigning new ones. in general, this consists of having a set of 'in > > > > scope' names (which is often something you need to keep track of anyway) > > > > and you choose some value that is not in the 'in scope' set and add it > > > > to the set. this ensures ids don't shadow each other, but not that they > > > > are unique. > > > > > > > > > > > > negative ids are used inside of a few specialized routines internally > > > > and should never make it out into the world. for instance, in the > > > > unification algorithm they are used to represent alpha-renamed values so > > > > we don't need to scan for used names before hand, however the terms > > > > returned have their original ids restored. > > > > > > > > hope this helps... > > > > > > > > It is quite bothersome to me that 'Id' is just an Int and not a > > > > newtype... that is something I have been slowing getting the code ready > > > > to fix for a while now.. > > > > > > I'd like to fix it in another way, if you have no objects. > > > > > > By my count there are 120 unique atoms in base-1.0.hl. Those 120 atoms > > > are saved ~2344 times each (there's a total of 281338 atoms). > > > > Oh, I should mention that I'm counting TVr's. Those a apparently > > mostly module names. > > A count of atoms that are forced shows a different result: 258155 > saved atoms, 7390 unique atoms. Not as bad as the TVr count but still > not good. The worst case scenario is: 990666 saved atoms, 9980 unique atoms. -- Cheers, Lemmih From john at repetae.net Wed Feb 13 22:27:06 2008 From: john at repetae.net (John Meacham) Date: Wed Feb 13 22:25:39 2008 Subject: [jhc] Re: Ids in Jhc. In-Reply-To: References: <20080213235022.GB9671@sliver.repetae.net> Message-ID: <20080214032706.GC9671@sliver.repetae.net> On Thu, Feb 14, 2008 at 01:29:43AM +0100, Lemmih wrote: > On Feb 14, 2008 12:50 AM, John Meacham wrote: > > (I'm gonna forward this response to the jhc list too, if that's okay) > > On Thu, Feb 14, 2008 at 12:26:49AM +0100, Lemmih wrote: > > > I've been wondering about your use of Ids in Jhc. > > > As far as I can see, the rules for Ids goes as following: > > > * named ids are odd > > > * unnamed ids are even > > > * id '0' has special significance. > > > > > > Can you tell me a bit about how unnamed ids are used, why id '0' has > > > special meaning, and how you generate new, unique unnamed ids. > > [snip] > > > named Ids are odd, these are looked up in a table that is implemented > > via a hash table written in c in the StringTable/ directory. due to an > > intentional quirk of its implementation, it only generates odd, positive > > values. Conversion between a number and its string form is constant > > time in both directions. > > It looks like it has been replaced by a Haskell implementation. No, I just pushed out my changes that actually redid the atom stuff. it is now down to about a fifth of what it used to be. John -- John Meacham - ?repetae.net?john? From john at repetae.net Wed Feb 13 22:36:43 2008 From: john at repetae.net (John Meacham) Date: Wed Feb 13 22:35:15 2008 Subject: [jhc] Re: Ids in Jhc. In-Reply-To: References: <20080213235022.GB9671@sliver.repetae.net> Message-ID: <20080214033643.GD9671@sliver.repetae.net> On Thu, Feb 14, 2008 at 01:23:08AM +0100, Lemmih wrote: > I'd like to fix it in another way, if you have no objects. > > By my count there are 120 unique atoms in base-1.0.hl. Those 120 atoms > are saved ~2344 times each (there's a total of 281338 atoms). > Compressing the files with gzip reduces the negative effect on file > size but the overhead of parsing 281k atoms is still there. > On my system, testing with HelloWorld.hs, parsing those atoms took 16% > of the runtime and 27% of the memory usage. > > I propose scrapping the current system of random atom ids in favor for > a per-module map of symbols. > Using an ADT for the ids would make the system more transparent. I'll > have to investigate the performance impact of this. What do you mean? just for saving binary files? it already converts them to an ADT on the way out and back in since atom numbers can be different between different runs of the program. If you mean using an ADT internally for ids, stopping doing that is what brought memory usage from gigabytes to megabytes. The text of the atoms only needs to be stored in a single place, also the fact they are int's means they can be unboxed in datatypes, meanding the garbage collector does not even have to care about them, another huge gain. Also, IntSet/IntMap are orders of magnitude faster than Set and Map and sets/maps of ids are the bed and butter of optimization passes. my only issue with them is that it is type Id = Int instead of newtype Id = Id Int I have been working on the binary format with my recent changes, it is now much less of an issue than it used to be in terms of speed and memory usage. I modified it to use bytestrings and split the file into various 'chunks' that can be lazily loaded independently. so when it only needs the dependency information that is all it needs to pull out. and so forth. I also put a hard limit on the length of atoms to 256 bytes, which means I can chop 7 bytes off the storage of each one. By far the biggest win was in Info/Info, which used to store the type of each field alongside of it as a string, now it stores a hash of the type as a number, this reduced the file size (after compression) by about 30%. John -- John Meacham - ?repetae.net?john? From naesten at gmail.com Thu Feb 14 00:01:57 2008 From: naesten at gmail.com (Samuel Bronson) Date: Thu Feb 14 00:00:31 2008 Subject: [jhc] Re: Ids in Jhc. In-Reply-To: <20080214033643.GD9671@sliver.repetae.net> References: <20080213235022.GB9671@sliver.repetae.net> <20080214033643.GD9671@sliver.repetae.net> Message-ID: On Feb 13, 2008 10:36 PM, John Meacham wrote: > Also, IntSet/IntMap are orders of magnitude faster than Set and Map and > sets/maps of ids are the bed and butter of optimization passes. Do optimization passes enjoy breakfast in bed? From lemmih at gmail.com Thu Feb 14 05:08:08 2008 From: lemmih at gmail.com (Lemmih) Date: Thu Feb 14 05:06:41 2008 Subject: [jhc] Re: Ids in Jhc. In-Reply-To: <20080214033643.GD9671@sliver.repetae.net> References: <20080213235022.GB9671@sliver.repetae.net> <20080214033643.GD9671@sliver.repetae.net> Message-ID: On Thu, Feb 14, 2008 at 4:36 AM, John Meacham wrote: > On Thu, Feb 14, 2008 at 01:23:08AM +0100, Lemmih wrote: > > I'd like to fix it in another way, if you have no objects. > > > > By my count there are 120 unique atoms in base-1.0.hl. Those 120 atoms > > are saved ~2344 times each (there's a total of 281338 atoms). > > Compressing the files with gzip reduces the negative effect on file > > size but the overhead of parsing 281k atoms is still there. > > On my system, testing with HelloWorld.hs, parsing those atoms took 16% > > of the runtime and 27% of the memory usage. > > > > I propose scrapping the current system of random atom ids in favor for > > a per-module map of symbols. > > Using an ADT for the ids would make the system more transparent. I'll > > have to investigate the performance impact of this. > > What do you mean? just for saving binary files? it already converts them > to an ADT on the way out and back in since atom numbers can be different > between different runs of the program. I mean something like: data Id = Phantom | Unnamed Int | Named Atom Giving special meaning to numbers seems like a hack. Optimizations should not come at the sacrifice of readability. Indexing atoms by a random number is also something that can and should be avoided. > If you mean using an ADT internally for ids, > stopping doing that is what brought memory usage from gigabytes to > megabytes. The text of the atoms only needs to be stored in a single > place, also the fact they are int's means they can be unboxed in > datatypes, meanding the garbage collector does not even have to care > about them, another huge gain. Only keeping a single instance of each unique string in memory is obviously the way to go. I'm not arguing against that. However, given that there are only a total of ~9000 unique strings, unboxing will most likely mean nothing. > Also, IntSet/IntMap are orders of magnitude faster than Set and Map and > sets/maps of ids are the bed and butter of optimization passes. IntSet and IntMap are truly very fast. However, how that maps into CPU time and memory usage is not known. Readability may have been sacrificed for no more than a few percents improvement. > my only issue with them is that it is > > type Id = Int > instead of > newtype Id = Id Int > > I have been working on the binary format with my recent changes, it is > now much less of an issue than it used to be in terms of speed and > memory usage. I modified it to use bytestrings and split the file into > various 'chunks' that can be lazily loaded independently. so when it > only needs the dependency information that is all it needs to pull out. > and so forth. I also put a hard limit on the length of atoms to 256 > bytes, which means I can chop 7 bytes off the storage of each one. Shaving 7 bytes off seems weird when each atom is stored 100 times on disk. Also, using a single byte to store length seems an awful like only storing the last two digits in the year. I'm more interested in algorithm optimizations. Optimizations that make the code easier to use and read, and thereby avoiding silly things like storing duplicate atoms on disk. > By far the biggest win was in Info/Info, which used to store the type of > each field alongside of it as a string, now it stores a hash of the type > as a number, this reduced the file size (after compression) by about 30%. Info seems a bit scary to me. It is not garbage collected, so it is up to the user to assure that references are released as early as possible. Btw, I'm trying to offer constructive criticism. I'm interested in actually writing code, not just pointing out weak spots. -- Cheers, Lemmih From lemmih at gmail.com Thu Feb 14 06:52:15 2008 From: lemmih at gmail.com (Lemmih) Date: Thu Feb 14 06:50:48 2008 Subject: [jhc] Darcs patch. Message-ID: The latests broke the makefile rule for jhcp. -- Cheers, Lemmih -------------- next part -------------- A non-text attachment was scrubbed... Name: jhcpfix.dpatch Type: application/octet-stream Size: 355 bytes Desc: not available Url : http://www.haskell.org/pipermail/jhc/attachments/20080214/ff652801/jhcpfix.obj From naesten at gmail.com Thu Feb 14 14:01:14 2008 From: naesten at gmail.com (Samuel Bronson) Date: Thu Feb 14 13:59:46 2008 Subject: [jhc] Re: Ids in Jhc. In-Reply-To: References: <20080213235022.GB9671@sliver.repetae.net> <20080214033643.GD9671@sliver.repetae.net> Message-ID: On 2/14/08, Lemmih wrote: > I mean something like: data Id = Phantom | Unnamed Int | Named Atom > Giving special meaning to numbers seems like a hack. Optimizations > should not come at the sacrifice of readability. It's not a hack if you use a newtype... at least, not observably. Yhc didn't, and it did do some really nasty stuff. (Possibly nhc predated newtypes?) From lemmih at gmail.com Thu Feb 14 16:10:54 2008 From: lemmih at gmail.com (Lemmih) Date: Thu Feb 14 16:09:25 2008 Subject: [jhc] Re: Ids in Jhc. In-Reply-To: References: <20080213235022.GB9671@sliver.repetae.net> <20080214033643.GD9671@sliver.repetae.net> Message-ID: On Thu, Feb 14, 2008 at 8:01 PM, Samuel Bronson wrote: > On 2/14/08, Lemmih wrote: > > > I mean something like: data Id = Phantom | Unnamed Int | Named Atom > > Giving special meaning to numbers seems like a hack. Optimizations > > should not come at the sacrifice of readability. > > It's not a hack if you use a newtype... at least, not observably. Yhc > didn't, and it did do some really nasty stuff. (Possibly nhc predated > newtypes?) If you hide the implementation then you might as well provide the user with an ADT instead of an Int. The optimizations in Jhc seems to be geared towards making things more "basic". Data types are unrolled so they can fit in an integer, Haskell code is replaced by C code. These optimizations give the illusion of improvements at the cost of readability. Rewriting a piece of Haskell code in C is rather silly when it is the algorithm that's broken. -- Cheers, Lemmih From lemmih at gmail.com Fri Feb 15 09:34:59 2008 From: lemmih at gmail.com (Lemmih) Date: Fri Feb 15 09:33:28 2008 Subject: [jhc] Optimization. Message-ID: Greetings, The attached patch is a two-step optimization: 1) Only call 'getType' 100,000,000 times instead of 600,000,000 times. 2) Greatly simplify the actual loop. Without this patch, compiling base-1.0.hl takes 19 minutes. With this patch, compile time is down to 11 minutes. Feedback would be greatly appreciated. -- Cheers, Lemmih -------------- next part -------------- A non-text attachment was scrubbed... Name: optimizeTc.dpatch Type: application/octet-stream Size: 1725 bytes Desc: not available Url : http://www.haskell.org/pipermail/jhc/attachments/20080215/43e3a2de/optimizeTc.obj From lemmih at gmail.com Fri Feb 15 13:21:55 2008 From: lemmih at gmail.com (Lemmih) Date: Fri Feb 15 13:20:27 2008 Subject: [jhc] Hotspots. Message-ID: Greetings, I've found a few hotspots that'll be working on. I'd be very interested in discussing solutions. Performance flaws: * IdMaps are used to generate new ids. * Ho files contain huge amounts of duplicate information. * Ho files aren't saved lazily. * C code is used for generating atoms. Repeatedly mapping variables to 'const Nothing' is very expensive. It is currently the most expensive procedure in Jhc, taking ~20% CPU time when compiling the base library. The base library contains 65 megabytes of uncompressed data. Most of that is duplicate information that disappears when it is compressed. However, parsing that amount of data takes considerable time. Ho files are completely serialized before they're written to disk. Using C code for generating atoms results in no performance improvement. There are only a few cases where using C is beneficial and this is not one of them. -- Cheers, Lemmih From john at repetae.net Fri Feb 15 17:46:11 2008 From: john at repetae.net (John Meacham) Date: Fri Feb 15 17:44:39 2008 Subject: [jhc] Optimization. In-Reply-To: References: Message-ID: <20080215224611.GA14415@sliver.repetae.net> On Fri, Feb 15, 2008 at 03:34:59PM +0100, Lemmih wrote: > Greetings, > > The attached patch is a two-step optimization: > 1) Only call 'getType' 100,000,000 times instead of 600,000,000 times. > 2) Greatly simplify the actual loop. > > Without this patch, compiling base-1.0.hl takes 19 minutes. With this > patch, compile time is down to 11 minutes. > Feedback would be greatly appreciated. thanks, Yeah, there is some questionable algorithmic stuff in the typechecker/optimizer. in particular I know System.Time hits some exponential behavior (I believe dealing with the large datatpe and its derived instances) I am not including the Maybe patch as I have a different one that reorganizes more of the libraries to help reduce the 'Prelude glut', the big mass of mutually recursive modules that include prelude. PS. could you send patches via 'darcs send'? that way my scripts will automatically be able to apply them. John -- John Meacham - ?repetae.net?john? From john at repetae.net Fri Feb 15 18:01:01 2008 From: john at repetae.net (John Meacham) Date: Fri Feb 15 17:59:29 2008 Subject: [jhc] Hotspots. In-Reply-To: References: Message-ID: <20080215230101.GB14415@sliver.repetae.net> On Fri, Feb 15, 2008 at 07:21:55PM +0100, Lemmih wrote: > Greetings, > > I've found a few hotspots that'll be working on. I'd be very > interested in discussing solutions. > > Performance flaws: > * IdMaps are used to generate new ids. > * Ho files contain huge amounts of duplicate information. > * Ho files aren't saved lazily. > * C code is used for generating atoms. > > Repeatedly mapping variables to 'const Nothing' is very expensive. It > is currently the most expensive procedure in Jhc, taking ~20% CPU time > when compiling the base library. Hmm.. yeah, sometimes I used Maps, sometimes Sets, depending on what I already have and sometimes it helped to switch between them, sometimes not. it wasn't always obvious. Ideally all new id selection will be done in Name.Id as part of the general plan to turn Id into a newtype. It has the newIds routine, a couple variants of that to work on IdMap and IdSet would be good. if I am just doing a map (const Nothing) before passing to the id selection routine then that can probably just be dropped since the id selection stuff doesn't care about the actual values in the map. The id selection can be finicky, using Set.size to seed the iterations helped a bunch but I wanted to try a hash function from Id -> Id at some point as it should reduce the time spent linearly probing for an open Id. > The base library contains 65 megabytes of uncompressed data. Most of > that is duplicate information that disappears when it is compressed. > However, parsing that amount of data takes considerable time. Which duplicate data in particular is concerning you? I am in the process of completely reorganizing the Ho file layout so it is probably best to hold off here. Some of the redundancy is there on purpose, but most probably isn't. > Ho files are completely serialized before they're written to disk. Yeah, the issue with fully lazy writing is we need the length of the chunk before we can produce the bytestring, however, it would be straightforward to write a routine that counted the data as it writes it out to disk then seeks back to fill in the length field. It would mean passing in a filehandle rather than getting out a lazy bytestring... but that is a little acceptable uglyness. > Using C code for generating atoms results in no performance > improvement. There are only a few cases where using C is beneficial > and this is not one of them. The C code here was mainly about reducing GC pressure. in any case, I'd like to leave it in, I am not sure I have explored all the uses of that cuckoo hash and it is relatively new code. And C is one of my native languages. :) John -- John Meacham - ?repetae.net?john? From john at repetae.net Fri Feb 15 18:08:18 2008 From: john at repetae.net (John Meacham) Date: Fri Feb 15 18:06:45 2008 Subject: [jhc] Re: Ids in Jhc. In-Reply-To: References: <20080213235022.GB9671@sliver.repetae.net> <20080214033643.GD9671@sliver.repetae.net> Message-ID: <20080215230818.GC14415@sliver.repetae.net> On Thu, Feb 14, 2008 at 10:10:54PM +0100, Lemmih wrote: > On Thu, Feb 14, 2008 at 8:01 PM, Samuel Bronson wrote: > > On 2/14/08, Lemmih wrote: > > > > > I mean something like: data Id = Phantom | Unnamed Int | Named Atom > > > Giving special meaning to numbers seems like a hack. Optimizations > > > should not come at the sacrifice of readability. > > > > It's not a hack if you use a newtype... at least, not observably. Yhc > > didn't, and it did do some really nasty stuff. (Possibly nhc predated > > newtypes?) > > If you hide the implementation then you might as well provide the user > with an ADT instead of an Int. The newtype Int _is_ an ADT. as in, all that will be exported from Name.Id is Id(). Haskell's ability for abstraction is one of its great qualities. :) Exposing it abstractly would be something I'd want to do no matter how it was implemented, The internal representation doesn't matter as long as the API is clean and APIs are what I care about. > The optimizations in Jhc seems to be geared towards making things more > "basic". Data types are unrolled so they can fit in an integer, > Haskell code is replaced by C code. These optimizations give the > illusion of improvements at the cost of readability. > Rewriting a piece of Haskell code in C is rather silly when it is the > algorithm that's broken. It is about providing a strong foundation with good abstract APIs. Hopefully it will get to the point where these little things do make a difference. I don't find it hurting readability really. or if it does, that is more of an issue of documentation. like having to look inside how ids are represented means that I didn't document how they behave externally enough rather than the representation needs to change. John -- John Meacham - ?repetae.net?john? From john at repetae.net Fri Feb 15 18:15:17 2008 From: john at repetae.net (John Meacham) Date: Fri Feb 15 18:13:45 2008 Subject: [jhc] Optimization. In-Reply-To: References: Message-ID: <20080215231517.GD14415@sliver.repetae.net> Actually what I really need to do to make issues like that maybe thing from coming up is allow derived instances to happen at a place other than where the type is delared. pulling in 'Read' to declare Bool just doesn't work that well. and I don't like all these manually written instances cluttering up the libraries.. perhaps a pragma.. like {-# DERIVE: Enum Bool #-}. or putting "deriving 'Read'" will just add a placeholder that will be expanded to a real derivation when the read class comes into scope. though, that is sort of hacky as I would have to fudge namespace resolution in deriving clauses. do other compilers do something clever here? it looks like ghc does what I do but with CPP tricks and basically writes out its own instances in full. John -- John Meacham - ?repetae.net?john? From john at repetae.net Fri Feb 15 18:45:02 2008 From: john at repetae.net (John Meacham) Date: Fri Feb 15 18:43:29 2008 Subject: [jhc] Optimization. In-Reply-To: References: Message-ID: <20080215234502.GE14415@sliver.repetae.net> On Fri, Feb 15, 2008 at 03:34:59PM +0100, Lemmih wrote: > The attached patch is a two-step optimization: > 1) Only call 'getType' 100,000,000 times instead of 600,000,000 times. > 2) Greatly simplify the actual loop. > > Without this patch, compiling base-1.0.hl takes 19 minutes. With this > patch, compile time is down to 11 minutes. > Feedback would be greatly appreciated. I am confused, what in the patch causes getType to be called less? As far as I can tell it is just replacing equality checks with 'isFoo' predicates. John -- John Meacham - ?repetae.net?john? From lemmih at gmail.com Fri Feb 15 18:52:30 2008 From: lemmih at gmail.com (Lemmih) Date: Fri Feb 15 18:51:00 2008 Subject: [jhc] Optimization. In-Reply-To: <20080215234502.GE14415@sliver.repetae.net> References: <20080215234502.GE14415@sliver.repetae.net> Message-ID: On Sat, Feb 16, 2008 at 12:45 AM, John Meacham wrote: > On Fri, Feb 15, 2008 at 03:34:59PM +0100, Lemmih wrote: > > > > The attached patch is a two-step optimization: > > 1) Only call 'getType' 100,000,000 times instead of 600,000,000 times. > > 2) Greatly simplify the actual loop. > > > > Without this patch, compiling base-1.0.hl takes 19 minutes. With this > > patch, compile time is down to 11 minutes. > > Feedback would be greatly appreciated. > > I am confused, what in the patch causes getType to be called less? As > far as I can tell it is just replacing equality checks with 'isFoo' > predicates. E.Values.isCheap checks whether its argument is atomic. The 'isAtomic' function is very expensive so I moved it down below the four static checks. -- Cheers, Lemmih From isaacdupree at charter.net Fri Feb 15 19:23:25 2008 From: isaacdupree at charter.net (Isaac Dupree) Date: Fri Feb 15 19:21:50 2008 Subject: [jhc] Optimization. In-Reply-To: <20080215231517.GD14415@sliver.repetae.net> References: <20080215231517.GD14415@sliver.repetae.net> Message-ID: <47B62CFD.3050001@charter.net> John Meacham wrote: > Actually what I really need to do to make issues like that maybe thing > from coming up is allow derived instances to happen at a place other > than where the type is delared. pulling in 'Read' to declare Bool just > doesn't work that well. and I don't like all these manually written > instances cluttering up the libraries.. > > perhaps a pragma.. like {-# DERIVE: Enum Bool #-}. or putting "deriving > 'Read'" will just add a placeholder that will be expanded to a real > derivation when the read class comes into scope. though, that is sort of > hacky as I would have to fudge namespace resolution in deriving clauses. now that GHC has "standalone deriving" syntax, standalone deriving might be something to start from > do other compilers do something clever here? it looks like ghc does what > I do but with CPP tricks and basically writes out its own instances in > full. > > John > > From john at repetae.net Fri Feb 15 21:13:39 2008 From: john at repetae.net (John Meacham) Date: Fri Feb 15 21:12:05 2008 Subject: [jhc] Optimization. In-Reply-To: References: <20080215234502.GE14415@sliver.repetae.net> Message-ID: <20080216021339.GF14415@sliver.repetae.net> On Sat, Feb 16, 2008 at 12:52:30AM +0100, Lemmih wrote: > On Sat, Feb 16, 2008 at 12:45 AM, John Meacham wrote: > > On Fri, Feb 15, 2008 at 03:34:59PM +0100, Lemmih wrote: > > > > > > > The attached patch is a two-step optimization: > > > 1) Only call 'getType' 100,000,000 times instead of 600,000,000 times. > > > 2) Greatly simplify the actual loop. > > > > > > Without this patch, compiling base-1.0.hl takes 19 minutes. With this > > > patch, compile time is down to 11 minutes. > > > Feedback would be greatly appreciated. > > > > I am confused, what in the patch causes getType to be called less? As > > far as I can tell it is just replacing equality checks with 'isFoo' > > predicates. > > E.Values.isCheap checks whether its argument is atomic. The 'isAtomic' > function is very expensive so I moved it down below the four static > checks. Ah, I see now. cool. there are probably other ones in there that can be moved around to improve speed. isAtomic was very cheap once upon a time.. but now the isFullyConst was added... It _may_ be possible to have isFullyConst only look one layer deep, since an invariant that is supposed to hold is that a constructor can only hold atomic arguments, hence any Literals under another one must be fully constant. However, I am not sure if that invarient is rigidly enforced through all compilation phases. like, there is a progression of transformations through normal forms, each one more strict than the last... hmm... perhaps a way to test would be for isAtomic only look a single layer deep but isFullyConst will do a full check. and isAtomic can do a 'isFullyConst' check only when -flint mode is enabled. We will probably have to go thorugh and change some isAtomics to isFullyConst depending on whether the code is known to be in normal form beforehand or not. John -- John Meacham - ?repetae.net?john? From lemmih at gmail.com Sat Feb 16 07:41:07 2008 From: lemmih at gmail.com (Lemmih) Date: Sat Feb 16 07:39:32 2008 Subject: [jhc] Optimization. In-Reply-To: <20080216021339.GF14415@sliver.repetae.net> References: <20080215234502.GE14415@sliver.repetae.net> <20080216021339.GF14415@sliver.repetae.net> Message-ID: On Feb 16, 2008 3:13 AM, John Meacham wrote: > > On Sat, Feb 16, 2008 at 12:52:30AM +0100, Lemmih wrote: > > On Sat, Feb 16, 2008 at 12:45 AM, John Meacham wrote: > > > On Fri, Feb 15, 2008 at 03:34:59PM +0100, Lemmih wrote: > > > > > > > > > > The attached patch is a two-step optimization: > > > > 1) Only call 'getType' 100,000,000 times instead of 600,000,000 times. > > > > 2) Greatly simplify the actual loop. > > > > > > > > Without this patch, compiling base-1.0.hl takes 19 minutes. With this > > > > patch, compile time is down to 11 minutes. > > > > Feedback would be greatly appreciated. > > > > > > I am confused, what in the patch causes getType to be called less? As > > > far as I can tell it is just replacing equality checks with 'isFoo' > > > predicates. > > > > E.Values.isCheap checks whether its argument is atomic. The 'isAtomic' > > function is very expensive so I moved it down below the four static > > checks. > > Ah, I see now. cool. there are probably other ones in there that can be > moved around to improve speed. No, 'isAtomic'/'isCheap' takes less than a single percent right now. Optimizing it more will be a waste of time. > isAtomic was very cheap once upon a time.. but now the isFullyConst was added... Actually, 'isFullyConst' is insignificant. It was 'getType', called multiple times from 'sortTypeLike' and 'sortKindLike', that took most of the CPU time. > It _may_ be possible to have isFullyConst only look one layer deep, > since an invariant that is supposed to hold is that a constructor can > only hold atomic arguments, hence any Literals under another one must be > fully constant. 'isFullyConst' is deliciously fast. It plays right at GHC's strengths. It compiles to a finely tuned loop that performs essentially no allocations. The code is very pretty and easy to understand; let's keep it that way. > However, I am not sure if that invarient is rigidly enforced through all > compilation phases. like, there is a progression of transformations > through normal forms, each one more strict than the last... hmm... > > perhaps a way to test would be for isAtomic only look a single layer > deep but isFullyConst will do a full check. and isAtomic can do a > 'isFullyConst' check only when -flint mode is enabled. We will probably > have to go thorugh and change some isAtomics to isFullyConst depending > on whether the code is known to be in normal form beforehand or not. The functions 'isAtomic', 'isFullyConst' and 'getType' are ridiculously cheap right now. They only check constructors without doing any allocation. With GHC's pointer tagging, this is as fast as hand tuned C code. Our attention should be diverted elsewhere. About 60% of the CPU time when compiling the base library goes to garbage collecting. The majority of garbage is created when converting from IdMap to IdSet and from 'IdMap a' to 'IdMap (Maybe a)'. Fixing these issues should give at least a factor of 2 in performance increase. -- Cheers, Lemmih From lemmih at gmail.com Mon Feb 18 18:36:50 2008 From: lemmih at gmail.com (Lemmih) Date: Mon Feb 18 18:35:07 2008 Subject: [jhc] Re: Optimization. In-Reply-To: References: Message-ID: On Feb 15, 2008 3:34 PM, Lemmih wrote: > Greetings, > > The attached patch is a two-step optimization: > 1) Only call 'getType' 100,000,000 times instead of 600,000,000 times. > 2) Greatly simplify the actual loop. > > Without this patch, compiling base-1.0.hl takes 19 minutes. With this > patch, compile time is down to 11 minutes. > Feedback would be greatly appreciated. Greetings, The attached patch optimizes the allocation of new ids in substitutions. Without this patch, compiling base-1.0.hl takes 11 minutes (down from 19). With this patch, compile time is down to 6 minutes. Feedback would be greatly appreciated. -- Cheers, Lemmih -------------- next part -------------- A non-text attachment was scrubbed... Name: optRoundTwo.dpatch Type: application/octet-stream Size: 10532 bytes Desc: not available Url : http://www.haskell.org/pipermail/jhc/attachments/20080219/28a3f5c5/optRoundTwo.obj From lemmih at gmail.com Tue Feb 19 11:24:49 2008 From: lemmih at gmail.com (Lemmih) Date: Tue Feb 19 11:23:07 2008 Subject: [jhc] Re: Optimization. In-Reply-To: References: Message-ID: Hiya, This patch pushes compile time down below 5 minutes. -- Cheers, Lemmih -------------- next part -------------- A non-text attachment was scrubbed... Name: optRoundThree.dpatch Type: application/octet-stream Size: 10232 bytes Desc: not available Url : http://www.haskell.org/pipermail/jhc/attachments/20080219/618a73d1/optRoundThree-0001.obj From lemmih at gmail.com Tue Feb 19 11:38:00 2008 From: lemmih at gmail.com (Lemmih) Date: Tue Feb 19 11:36:16 2008 Subject: [jhc] Hotspots. In-Reply-To: <20080215230101.GB14415@sliver.repetae.net> References: <20080215230101.GB14415@sliver.repetae.net> Message-ID: On Feb 16, 2008 12:01 AM, John Meacham wrote: > On Fri, Feb 15, 2008 at 07:21:55PM +0100, Lemmih wrote: > > Greetings, > > > > I've found a few hotspots that'll be working on. I'd be very > > interested in discussing solutions. > > > > Performance flaws: > > * IdMaps are used to generate new ids. > > * Ho files contain huge amounts of duplicate information. > > * Ho files aren't saved lazily. > > * C code is used for generating atoms. > > > > Repeatedly mapping variables to 'const Nothing' is very expensive. It > > is currently the most expensive procedure in Jhc, taking ~20% CPU time > > when compiling the base library. > > Hmm.. yeah, sometimes I used Maps, sometimes Sets, depending on what I > already have and sometimes it helped to switch between them, sometimes > not. it wasn't always obvious. Ideally all new id selection will be > done in Name.Id as part of the general plan to turn Id into a newtype. > It has the newIds routine, a couple variants of that to work on IdMap > and IdSet would be good. if I am just doing a map (const Nothing) before > passing to the id selection routine then that can probably just be > dropped since the id selection stuff doesn't care about the actual > values in the map. > > The id selection can be finicky, using Set.size to seed the iterations > helped a bunch but I wanted to try a hash function from Id -> Id at some > point as it should reduce the time spent linearly probing for an open > Id. > > > The base library contains 65 megabytes of uncompressed data. Most of > > that is duplicate information that disappears when it is compressed. > > However, parsing that amount of data takes considerable time. > > Which duplicate data in particular is concerning you? I am in the > process of completely reorganizing the Ho file layout so it is probably > best to hold off here. Some of the redundancy is there on purpose, but > most probably isn't. Each atom is saved ~100 times. A TVr can contain 50k of data and each TVr is saved ~24 times. -- Cheers, Lemmih From lemmih at gmail.com Tue Feb 19 17:27:48 2008 From: lemmih at gmail.com (Lemmih) Date: Tue Feb 19 17:26:05 2008 Subject: [jhc] Infos. Message-ID: Hi, In Ho/Build.hs, TVr's are stripped of Type and Info information before the Ho files are saved. I experimented with omitting Info information when saving TVr's and everything appeared to work as expected. Is there some corner case where saving the Info list is necessary? -- Cheers, Lemmih From john at repetae.net Tue Feb 19 20:16:02 2008 From: john at repetae.net (John Meacham) Date: Tue Feb 19 20:14:15 2008 Subject: [jhc] Infos. In-Reply-To: References: Message-ID: <20080220011602.GC22804@sliver.repetae.net> On Tue, Feb 19, 2008 at 11:27:48PM +0100, Lemmih wrote: > In Ho/Build.hs, TVr's are stripped of Type and Info information before > the Ho files are saved. I experimented with omitting Info information > when saving TVr's and everything appeared to work as expected. Is > there some corner case where saving the Info list is necessary? you would lose demand (strictness analysis), CPR analysis information, and properties like INLINE and NOINLINE and foreign export declarations. The whole info is not saved to the disk, only a few select fields that are enumerated in Info/Binary.hs are, the rest are automatically discarded by the binary instance. I regularly add or subtract fields to that list as I try out new optimizations or determine whether saving vs regerating certain information is worth it. John -- John Meacham - ?repetae.net?john? From john at repetae.net Tue Feb 19 20:24:19 2008 From: john at repetae.net (John Meacham) Date: Tue Feb 19 20:22:33 2008 Subject: [jhc] Hotspots. In-Reply-To: References: <20080215230101.GB14415@sliver.repetae.net> Message-ID: <20080220012419.GD22804@sliver.repetae.net> On Tue, Feb 19, 2008 at 05:38:00PM +0100, Lemmih wrote: > > Which duplicate data in particular is concerning you? I am in the > > process of completely reorganizing the Ho file layout so it is probably > > best to hold off here. Some of the redundancy is there on purpose, but > > most probably isn't. > > Each atom is saved ~100 times. A TVr can contain 50k of data and each > TVr is saved ~24 times. All tvrs except the head ones in the hoEs field should be fully stripped of their auxiliary information. processInitialHo which is run on all ho's read from disk among other things fixes up these references. I should note that every ocurrance of a variable having the exact same information as its head is a _strong_ invarient, not just equivalent due to alpha renaming or something, in fact, it should be the exact same heap node but just shared. Incidentally, every now and again I 'promote' something from the Info field to being a first class value of some sort. I am considering re-doing rules in this fashion so a binding will not only have a body but a list of rules. as in, the rules will be associated with the body of a function rather than its tvr. This will have some pervasive effects and hopefully make things nicer, however for simplicity I am probably going to introduce a constraint that only top level values can have attached rules. does anyone think this is too onerous of a restriction? note that SPECIALIZATIONs and CATALYSTs will be under the same restriction. John -- John Meacham - ?repetae.net?john? From lemmih at gmail.com Wed Feb 20 13:42:02 2008 From: lemmih at gmail.com (Lemmih) Date: Wed Feb 20 13:40:26 2008 Subject: [jhc] darcs patch: Cache scoping info, avoid unnecessary maps. (and 2 more) Message-ID: <47bc7480.1836440a.2953.12f9@mx.google.com> Tue Feb 19 16:14:10 CET 2008 Lemmih * Cache scoping info, avoid unnecessary maps. Tue Feb 19 16:15:54 CET 2008 Lemmih * Remove debug code. Wed Feb 20 19:40:27 CET 2008 Lemmih * Add a comment describing the function of E.Subst. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/x-darcs-patch Size: 9918 bytes Desc: A darcs patch for your repository! Url : http://www.haskell.org/pipermail/jhc/attachments/20080220/295eb766/attachment.bin From lemmih at gmail.com Wed Feb 20 13:49:32 2008 From: lemmih at gmail.com (Lemmih) Date: Wed Feb 20 13:47:53 2008 Subject: [jhc] darcs patch: Makefile wibbles for jhcp and *.hsc Message-ID: <47bc7644.1636440a.3ae8.4812@mx.google.com> Wed Feb 20 19:49:10 CET 2008 Lemmih * Makefile wibbles for jhcp and *.hsc -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/x-darcs-patch Size: 1912 bytes Desc: A darcs patch for your repository! Url : http://www.haskell.org/pipermail/jhc/attachments/20080220/c1798251/attachment-0001.bin From csaba.hruska at gmail.com Wed Feb 20 17:14:41 2008 From: csaba.hruska at gmail.com (Csaba Hruska) Date: Wed Feb 20 17:12:53 2008 Subject: [jhc] jhc backend In-Reply-To: <8914b92d0802201411m6332c4b9xb67b32d2d8e45652@mail.gmail.com> References: <8914b92d0802201411m6332c4b9xb67b32d2d8e45652@mail.gmail.com> Message-ID: <8914b92d0802201414q702f268fnf485af2f93c72ea8@mail.gmail.com> Hi! Will jhc strictly generate only c code ? I know the llvm compiler framework (llvm.org), and what do you think of using llvm for (imperative) backend instead of current c backend. LLVM can generate c code too and it supports many platforms (X86, IA64, SPARC, ARM, MIPS, SPU, POWERPC) and also handles all platform specific stuff. LLVM uses an intermediate language (typed risk lile language), and uses SSA form. Many optimizations can be done via llvm. and llvm supports JIT (Just In Time ) compilationm, taht can be useful for interactive (interpreters) use of haskell. any opinion ? Cheers, Csaba Hruska P.S.: I'm new here so let me to introduce: My name is Csaba Hruska, i'm in the 5th year in university (BME budapest, hungary). i'm studiyng computer science. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.haskell.org/pipermail/jhc/attachments/20080220/c02ce570/attachment.htm From john at repetae.net Wed Feb 20 18:16:06 2008 From: john at repetae.net (John Meacham) Date: Wed Feb 20 18:14:26 2008 Subject: [jhc] jhc backend In-Reply-To: <8914b92d0802201414q702f268fnf485af2f93c72ea8@mail.gmail.com> References: <8914b92d0802201411m6332c4b9xb67b32d2d8e45652@mail.gmail.com> <8914b92d0802201414q702f268fnf485af2f93c72ea8@mail.gmail.com> Message-ID: <20080220231606.GA26731@sliver.repetae.net> On Wed, Feb 20, 2008 at 11:14:41PM +0100, Csaba Hruska wrote: > Hi! > Will jhc strictly generate only c code ? > I know the llvm compiler framework (llvm.org), and what do you think of > using llvm for (imperative) backend instead of current c backend. > LLVM can generate c code too and it supports many platforms (X86, IA64, > SPARC, ARM, MIPS, SPU, POWERPC) and also handles all platform specific > stuff. Yeah, I looked at LLVM, it should actually be quite straightforward to create a back end for it. LLVM might be a particularly well suited back end for jhc since Monadic Grin code has a direct correspondence to SSA form. I have been getting rid of all C specific stuff from the compilation path, for instance, now basic operators are specified as an abstract type in C.Op rather than as mappings to C operators. > LLVM uses an intermediate language (typed risk lile language), and uses SSA > form. Many optimizations can be done via llvm. > and llvm supports JIT (Just In Time ) compilationm, taht can be useful for > interactive (interpreters) use of haskell. > > any opinion ? Sounds like it could be fun. I am all for jhc having many different back ends. I have most of the framework for a .NET CLR back end implemented as well. John > P.S.: > I'm new here so let me to introduce: My name is Csaba Hruska, i'm in the 5th > year in university (BME budapest, hungary). i'm studiyng computer science. Welcome! :) -- John Meacham - ?repetae.net?john? From lemmih at gmail.com Thu Feb 21 14:32:47 2008 From: lemmih at gmail.com (Lemmih) Date: Thu Feb 21 14:31:02 2008 Subject: [jhc] Atoms, Infos and unique ids. Message-ID: Hiya, TVr's can be large and are duplicated many times over. We avoid duplicating Info and type information by clearing all variables and restoring them from their binding point. Atoms are very similar to Infos in this regard: they're large enough that we don't want to keep multiple copies around. However, atoms are currently dealt with in an unsafe (see 'getMap') and inefficient manner (10mb (40%) in base-1.0.hl is wasted on duplicate strings). It is my understanding that atoms could be stored exactly like Infos. I tried to implement this but, unfortunately, atoms are frequently being relied on for unique ids. Some cases are easy to fix, others less so. John, do you have time to document the intended behavior in the difficult cases? -- Cheers, Lemmih From john at repetae.net Thu Feb 21 15:09:28 2008 From: john at repetae.net (John Meacham) Date: Thu Feb 21 15:07:36 2008 Subject: [jhc] Atoms, Infos and unique ids. In-Reply-To: References: Message-ID: <20080221200928.GA10754@sliver.repetae.net> On Thu, Feb 21, 2008 at 08:32:47PM +0100, Lemmih wrote: > I tried to implement this but, unfortunately, atoms are frequently > being relied on for unique ids. Some cases are easy to fix, others > less so. > John, do you have time to document the intended behavior in the difficult cases? I am not sure what you mean by difficult cases or what you are trying to fix, atoms are exactly an implementation of the standard atom type in computing as used in prolog, lisp, and the X11 protocol among other things. Uninterpreted strings with a very fast identity operation. Things got a lot better in terms of space once I made them a custom binary implementation, that saved 7 bytes an atom which is often as long as the string itself. In any case, I would want any solution to be completely independent of the fact that atoms are being used as identifiers in a programming intermediate langauge. My Atom type is a generally useful library I use other places. I would think this would involve a custom Binary monad that distributed and collected an 'atom environment' of sorts that would then be stored in a different chunk in the file, then whenever one wanted to store/retrieve an atom, they would just add an index into the table. John -- John Meacham - ?repetae.net?john? From lemmih at gmail.com Thu Feb 21 16:03:08 2008 From: lemmih at gmail.com (Lemmih) Date: Thu Feb 21 16:01:20 2008 Subject: [jhc] Atoms, Infos and unique ids. In-Reply-To: <20080221200928.GA10754@sliver.repetae.net> References: <20080221200928.GA10754@sliver.repetae.net> Message-ID: On Thu, Feb 21, 2008 at 9:09 PM, John Meacham wrote: > On Thu, Feb 21, 2008 at 08:32:47PM +0100, Lemmih wrote: > > I tried to implement this but, unfortunately, atoms are frequently > > being relied on for unique ids. Some cases are easy to fix, others > > less so. > > John, do you have time to document the intended behavior in the difficult cases? > > I am not sure what you mean by difficult cases or what you are trying to > fix, atoms are exactly an implementation of the standard atom > type in computing as used in prolog, lisp, and the X11 protocol among > other things. Uninterpreted strings with a very fast identity operation. Atoms in general are fine but they're not the right tool for this particular job. We rarely use atoms for keys (I have the profiling information to back this up) and, when we do, using (hash,string) is ridiculously fast. The costs of using atoms include: broken properties (get . put = id), inefficiencies and obfuscated code. Given that I have significantly improved the performance of Jhc, I hope it carries some weight when I say that there are NO performance reasons for using atoms, neither in CPU time or memory usage. > Things got a lot better in terms of space once I made them a custom > binary implementation, that saved 7 bytes an atom which is often as long > as the string itself. That's like giving ice cream to a kid with tuberculosis: It's a a nice thing to do but curing the disease would be better. Those 7 bytes would be completely irrelevant if we only saved unique strings once. There are about 9000 unique strings. A 7 byte overhead would be 63k. We currently waste 10,000k by saving ~100 copies of each unique string. > In any case, I would want any solution to be completely independent of > the fact that atoms are being used as identifiers in a programming > intermediate langauge. I couldn't agree more. I say, let's take it one step further and assign unique ids to named variables in the same way we do with unnamed variables. Each variable would have a 'Anonymous | Named Name' tag that would be used for pretty-printing. > My Atom type is a generally useful library I use > other places. I would think this would involve a custom Binary monad > that distributed and collected an 'atom environment' of sorts that would > then be stored in a different chunk in the file, then whenever one > wanted to store/retrieve an atom, they would just add an index into the > table. The only other usage I've found was in Stats.hs. And that's definitely not justified. I was referring to cases like DataConstructors.hs and LambdaLift.hs. In DataConstructor.hs it is trivial to assign a new id since it only has to be locally unique. In LambdaLift.hs, on the other hand, I can't tell whether shadowing is OK, whether reusing the old id is OK, or where to get a set free variables if a unique id is required. Some documentation about what the code tries to do would be very helpful. -- Cheers, Lemmih From john at repetae.net Thu Feb 21 17:23:56 2008 From: john at repetae.net (John Meacham) Date: Thu Feb 21 17:22:04 2008 Subject: [jhc] Atoms, Infos and unique ids. In-Reply-To: References: <20080221200928.GA10754@sliver.repetae.net> Message-ID: <20080221222356.GB10754@sliver.repetae.net> On Thu, Feb 21, 2008 at 10:03:08PM +0100, Lemmih wrote: > Atoms in general are fine but they're not the right tool for this > particular job. We rarely use atoms for keys (I have the profiling > information to back this up) and, when we do, using (hash,string) is > ridiculously fast. The costs of using atoms include: broken properties > (get . put = id), inefficiencies and obfuscated code. Given that I > have significantly improved the performance of Jhc, I hope it carries > some weight when I say that there are NO performance reasons for using > atoms, neither in CPU time or memory usage. Well, the main win is the unboxed Int# inside of TVr and being able to use IdMaps and being able to locally generate unique names based on current ones without refering to and keeping a global table up to date. (see docs/conventions.txt for information on some of the ways unique names are generate from existing ones) Plus, Atoms feel like the aethetic correct choice to me. Uninterpreted unique strings with fast identity, exactly what I want out of an identifier. Even if they were just newtype Atom = Atom String. (A perfectly valid, if inefficient, implementation) > > Things got a lot better in terms of space once I made them a custom > > binary implementation, that saved 7 bytes an atom which is often as long > > as the string itself. > > That's like giving ice cream to a kid with tuberculosis: It's a a nice > thing to do but curing the disease would be better. > > Those 7 bytes would be completely irrelevant if we only saved unique > strings once. There are about 9000 unique strings. A 7 byte overhead > would be 63k. We currently waste 10,000k by saving ~100 copies of each > unique string. but this seems completely orthogonal to using atoms in general, any system allowing string ids will need to store them somehow. > > In any case, I would want any solution to be completely independent of > > the fact that atoms are being used as identifiers in a programming > > intermediate langauge. > > I couldn't agree more. I say, let's take it one step further and > assign unique ids to named variables in the same way we do with > unnamed variables. Each variable would have a 'Anonymous | Named Name' > tag that would be used for pretty-printing. We rarely have a list of all names in scope though. or piping one around would be unwieldy. the 'name's' of named variables have semantic meaning as well (as described in docs/conventions.txt). Plus, I am wary of non-abstract types in general after some bad growing pains. > > My Atom type is a generally useful library I use > > other places. I would think this would involve a custom Binary monad > > that distributed and collected an 'atom environment' of sorts that would > > then be stored in a different chunk in the file, then whenever one > > wanted to store/retrieve an atom, they would just add an index into the > > table. > > The only other usage I've found was in Stats.hs. And that's definitely > not justified. Again, this is more for aethetics and clear code intent (and incidental performance gains, no matter how small, are always good). Atoms are what I mean so they are what I use. > I was referring to cases like DataConstructors.hs and LambdaLift.hs. > In DataConstructor.hs it is trivial to assign a new id since it only > has to be locally unique. Yes. If you are refering to the using [2,4 ..] there, this is one of the last places that the Id abstraction escapes. once I clear that up I can finally fully abstract Id. Yay. > In LambdaLift.hs, on the other hand, I can't tell whether shadowing is > OK, whether reusing the old id is OK, or where to get a set free > variables if a unique id is required. Some documentation about what > the code tries to do would be very helpful. I believe the current lambda lifting algorithm requires fully globally unique names for everything. shadowing would not be okay as definitons might be lifted to the same level. it is tricky. I would have to reread it to find out for sure. As with most optimization passes in jhc, it is not my first implementation and I don't expect it to be the last. (for a while I had 4 different strictness analyzers to choose from :) ) At the moment my development tree can't compile anything (without -fno-rules) because I am in the middle of completely rewriting the rules mechanism. (they will no longer be carried in the Info nodes at all) and supercombinators will be logically distinct from let bound values. treating the whole program as a simple mutally recursive let binding is quite elegant, but i have been holding onto that ideal for way to long and have had to go through too many crazy contortions to keep up that unrealistic world view. I already had to have special cases in the type system (top level bindings can have superkind ##, local ones cannot) among other places. So I'd like to hold off on any non localized changes for the time being. In any case, I don't want to do anything with Ids at the moment other than making them more abstract, I am trying to get 0.5 out the door and I am really uncomfortable making multiple fundamental changes at once, especially to something as so very tricky as id naming. If you want crazy obfuscated code in pursuit of performance, look at ghc. at least I try to stay mostly haskell 98 (well, haskell-prime beta is actually what I target). avoiding things like explicit unboxed types and try to keep all strangeness encapsulated behind abstract types with well defined interfaces. ghc lobs around Int#'s like candy :) BTW. I also have been working on a full honest to goodness manual for jhc. my patches implementing it are just starting to appear in the repository. The rough idea is you can create comments anywhere (or in separate files) that start with {-@Section and contain markdown formatted text and they will be stitched together to form the manual. It is filling out nicely. John -- John Meacham - ?repetae.net?john? From lemmih at gmail.com Thu Feb 21 18:13:10 2008 From: lemmih at gmail.com (Lemmih) Date: Thu Feb 21 18:11:20 2008 Subject: [jhc] Atoms, Infos and unique ids. In-Reply-To: <20080221222356.GB10754@sliver.repetae.net> References: <20080221200928.GA10754@sliver.repetae.net> <20080221222356.GB10754@sliver.repetae.net> Message-ID: On Thu, Feb 21, 2008 at 11:23 PM, John Meacham wrote: > If you want crazy obfuscated code in pursuit of performance, look at > ghc. at least I try to stay mostly haskell 98 (well, haskell-prime beta > is actually what I target). avoiding things like explicit unboxed types > and try to keep all strangeness encapsulated behind abstract types with > well defined interfaces. ghc lobs around Int#'s like candy :) I wanna make a quick response to this segment because I actually feel slightly insulted. I'm trying to get rid of a global mutable state and some C code that segfaults if used incorrectly, and you're accusing me of obfuscating the code? I'm proposing to associate names directly with variables instead of using a magic pointer. It would be the natural thing to do, completely valid Haskell98 code, and several times faster than the current approach. I feel you've made a serious accusation without the evidence to back it up. ): -- Cheers, Lemmih From john at repetae.net Thu Feb 21 18:53:26 2008 From: john at repetae.net (John Meacham) Date: Thu Feb 21 18:51:33 2008 Subject: [jhc] Atoms, Infos and unique ids. In-Reply-To: References: <20080221200928.GA10754@sliver.repetae.net> <20080221222356.GB10754@sliver.repetae.net> Message-ID: <20080221235325.GC10754@sliver.repetae.net> On Fri, Feb 22, 2008 at 12:13:10AM +0100, Lemmih wrote: > On Thu, Feb 21, 2008 at 11:23 PM, John Meacham wrote: > > If you want crazy obfuscated code in pursuit of performance, look at > > ghc. at least I try to stay mostly haskell 98 (well, haskell-prime beta > > is actually what I target). avoiding things like explicit unboxed types > > and try to keep all strangeness encapsulated behind abstract types with > > well defined interfaces. ghc lobs around Int#'s like candy :) > > I wanna make a quick response to this segment because I actually feel > slightly insulted. I'm trying to get rid of a global mutable state and > some C code that segfaults if used incorrectly, and you're accusing me > of obfuscating the code? Oh, no, I didn't mean to imply that at all. I _really_ appreciate the work you are putting in to making jhc faster. what I meant to say is that I know parts of jhc are _already_ obfuscated, but that as far as haskell compilers go, it's not as bad as it could be. :) As in, I endevour to make sure the APIs are clean and well founded. that is what really matters when it comes to maintainability of code. The Id implementation can be swapped out willy nilly once it is fully abstracted, so worrying about how it actually is implemented seems premature. In the end, once the APIs are stable, it is easy enough to plug in a variety of implementations and just try em all out. That should be true of any component of jhc if I did my design right. > I'm proposing to associate names directly > with variables instead of using a magic pointer. It would be the > natural thing to do, completely valid Haskell98 code, and several > times faster than the current approach. Hmm... are you sure it would be faster? perhaps I don't fully understand what you want to do, but Atoms were darn fast when I was benchmarking, I could have broken them though. I mean, perhaps the speed benefit isn't that useful for jhc... I use the same atom implementation in C projects but I enjoy that in haskell-land I can hide the implementation behind a newtype to make them fully safe. I heart haskell. John -- John Meacham - ?repetae.net?john? From john at repetae.net Thu Feb 21 19:21:01 2008 From: john at repetae.net (John Meacham) Date: Thu Feb 21 19:19:09 2008 Subject: [jhc] Atoms, Infos and unique ids. In-Reply-To: <20080221235325.GC10754@sliver.repetae.net> References: <20080221200928.GA10754@sliver.repetae.net> <20080221222356.GB10754@sliver.repetae.net> <20080221235325.GC10754@sliver.repetae.net> Message-ID: <20080222002101.GA11794@sliver.repetae.net> On Thu, Feb 21, 2008 at 03:53:26PM -0800, John Meacham wrote: > Oh, no, I didn't mean to imply that at all. I _really_ appreciate the > work you are putting in to making jhc faster. what I meant to say is > that I know parts of jhc are _already_ obfuscated, but that as far as > haskell compilers go, it's not as bad as it could be. :) Oh, I should also say that I mean no disrespect to the GHC developers, the only reason I can use simple strictness annotations instead of explicit unboxed types is their sweet strictness analyzer, the only reason I can use more readable 'do' syntax rather than `thenM` style monadic code is their sweet implementation of classes. GHC is great code and the papers that came out of its implementation were the inspiration for most parts of jhc. :) John -- John Meacham - ?repetae.net?john? From lemmih at gmail.com Thu Feb 21 19:34:32 2008 From: lemmih at gmail.com (Lemmih) Date: Thu Feb 21 19:32:41 2008 Subject: [jhc] Atoms, Infos and unique ids. In-Reply-To: <20080221235325.GC10754@sliver.repetae.net> References: <20080221200928.GA10754@sliver.repetae.net> <20080221222356.GB10754@sliver.repetae.net> <20080221235325.GC10754@sliver.repetae.net> Message-ID: On Fri, Feb 22, 2008 at 12:53 AM, John Meacham wrote: > On Fri, Feb 22, 2008 at 12:13:10AM +0100, Lemmih wrote: > > On Thu, Feb 21, 2008 at 11:23 PM, John Meacham wrote: > > > If you want crazy obfuscated code in pursuit of performance, look at > > > ghc. at least I try to stay mostly haskell 98 (well, haskell-prime beta > > > is actually what I target). avoiding things like explicit unboxed types > > > and try to keep all strangeness encapsulated behind abstract types with > > > well defined interfaces. ghc lobs around Int#'s like candy :) > > > > I wanna make a quick response to this segment because I actually feel > > slightly insulted. I'm trying to get rid of a global mutable state and > > some C code that segfaults if used incorrectly, and you're accusing me > > of obfuscating the code? > > Oh, no, I didn't mean to imply that at all. I _really_ appreciate the > work you are putting in to making jhc faster. what I meant to say is > that I know parts of jhc are _already_ obfuscated, but that as far as > haskell compilers go, it's not as bad as it could be. :) Ah, okay. > As in, I endevour to make sure the APIs are clean and well founded. that > is what really matters when it comes to maintainability of code. The Id > implementation can be swapped out willy nilly once it is fully > abstracted, so worrying about how it actually is implemented seems > premature. In the end, once the APIs are stable, it is easy enough to > plug in a variety of implementations and just try em all out. That > should be true of any component of jhc if I did my design right. > > > > I'm proposing to associate names directly > > with variables instead of using a magic pointer. It would be the > > natural thing to do, completely valid Haskell98 code, and several > > times faster than the current approach. > > Hmm... are you sure it would be faster? perhaps I don't fully understand > what you want to do, but Atoms were darn fast when I was benchmarking, I > could have broken them though. I mean, perhaps the speed benefit isn't > that useful for jhc... I use the same atom implementation in C projects > but I enjoy that in haskell-land I can hide the implementation behind a > newtype to make them fully safe. I heart haskell. The problem is not atoms per se; it's generating ids from a global store. When saving TVr's with associated names, the atom has to be saved instead of just the id. Duplicating few strings may not sound too serious but it does take its toll. It is relatively minor (but still significant) when compiling base-1.0.hl (7% cpu, 8% memory usage). However, it is dominating when compiling smaller pieces of code such as HelloWorld. -- Cheers, Lemmih From john at repetae.net Thu Feb 21 19:56:07 2008 From: john at repetae.net (John Meacham) Date: Thu Feb 21 19:54:14 2008 Subject: [jhc] Atoms, Infos and unique ids. In-Reply-To: References: <20080221200928.GA10754@sliver.repetae.net> <20080221222356.GB10754@sliver.repetae.net> <20080221235325.GC10754@sliver.repetae.net> Message-ID: <20080222005607.GB11794@sliver.repetae.net> On Fri, Feb 22, 2008 at 01:34:32AM +0100, Lemmih wrote: > > Hmm... are you sure it would be faster? perhaps I don't fully understand > > what you want to do, but Atoms were darn fast when I was benchmarking, I > > could have broken them though. I mean, perhaps the speed benefit isn't > > that useful for jhc... I use the same atom implementation in C projects > > but I enjoy that in haskell-land I can hide the implementation behind a > > newtype to make them fully safe. I heart haskell. > > The problem is not atoms per se; it's generating ids from a global > store. When saving TVr's with associated names, the atom has to be > saved instead of just the id. Duplicating few strings may not sound > too serious but it does take its toll. It is relatively minor (but > still significant) when compiling base-1.0.hl (7% cpu, 8% memory > usage). However, it is dominating when compiling smaller pieces of > code such as HelloWorld. Oh, yes. I agree this needs to be fixed. but it seems orthogonal to the issue of using atoms in the first place, as in. Any solution that works for something that uses 'String' will work for 'Atom' and vice versa. Something to try might be to rename everything other than top level names to numeric ids right before writing out the ho file. Top level names need to be globally unique, across time and space, as in, things will break if two independent modules chose the same top level name for something and then imported each other, but presumably by the time you write out the ho you have lambda lifted most stuff out and the debugging value of preserving them is less. Though, you have not converted things to supercombinators which will introduce more global names. (I actually should look into whether doing this before writing ho files is useful). In any case. it might be a simple thing to test out. Otherwise I think the idea of a Binary type that maintains an environment would be good, as we might find other uses for it besides jhc identifiers. John -- John Meacham - ?repetae.net?john? From lemmih at gmail.com Thu Feb 21 20:23:35 2008 From: lemmih at gmail.com (Lemmih) Date: Thu Feb 21 20:21:44 2008 Subject: [jhc] Atoms, Infos and unique ids. In-Reply-To: <20080222005607.GB11794@sliver.repetae.net> References: <20080221200928.GA10754@sliver.repetae.net> <20080221222356.GB10754@sliver.repetae.net> <20080221235325.GC10754@sliver.repetae.net> <20080222005607.GB11794@sliver.repetae.net> Message-ID: On Fri, Feb 22, 2008 at 1:56 AM, John Meacham wrote: > On Fri, Feb 22, 2008 at 01:34:32AM +0100, Lemmih wrote: > > > Hmm... are you sure it would be faster? perhaps I don't fully understand > > > what you want to do, but Atoms were darn fast when I was benchmarking, I > > > could have broken them though. I mean, perhaps the speed benefit isn't > > > that useful for jhc... I use the same atom implementation in C projects > > > but I enjoy that in haskell-land I can hide the implementation behind a > > > newtype to make them fully safe. I heart haskell. > > > > The problem is not atoms per se; it's generating ids from a global > > store. When saving TVr's with associated names, the atom has to be > > saved instead of just the id. Duplicating few strings may not sound > > too serious but it does take its toll. It is relatively minor (but > > still significant) when compiling base-1.0.hl (7% cpu, 8% memory > > usage). However, it is dominating when compiling smaller pieces of > > code such as HelloWorld. > > Oh, yes. I agree this needs to be fixed. but it seems orthogonal to the > issue of using atoms in the first place, as in. Any solution that works > for something that uses 'String' will work for 'Atom' and vice versa. Yes, sorry for not being clear. It's the pointers to atoms, and not atoms themselv, that are causing trouble. > Something to try might be to rename everything other than top level > names to numeric ids right before writing out the ho file. Top level > names need to be globally unique, across time and space, as in, things > will break if two independent modules chose the same top level name for > something and then imported each other, but presumably by the time you > write out the ho you have lambda lifted most stuff out and the debugging > value of preserving them is less. > > Though, you have not converted things to supercombinators which will > introduce more global names. (I actually should look into whether doing > this before writing ho files is useful). > > In any case. it might be a simple thing to test out. > > Otherwise I think the idea of a Binary type that maintains an > environment would be good, as we might find other uses for it besides > jhc identifiers. I a fan of assigning locally unique ids as necessary since we already do this on a per-function level. -- Cheers, Lemmih From lemmih at gmail.com Thu Feb 21 21:19:19 2008 From: lemmih at gmail.com (Lemmih) Date: Thu Feb 21 21:17:31 2008 Subject: [jhc] darcs patch: Use an IdMap in E.Demand. This gives an 18% speed-up a... Message-ID: <47be3129.2135440a.49bd.ffffceb8@mx.google.com> Fri Feb 22 03:14:45 CET 2008 Lemmih * Use an IdMap in E.Demand. This gives an 18% speed-up and smaller ho files. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/x-darcs-patch Size: 5546 bytes Desc: A darcs patch for your repository! Url : http://www.haskell.org/pipermail/jhc/attachments/20080222/52e0f877/attachment.bin From john at repetae.net Thu Feb 21 22:00:21 2008 From: john at repetae.net (John Meacham) Date: Thu Feb 21 21:58:28 2008 Subject: [jhc] darcs patch: Use an IdMap in E.Demand. This gives an 18% speed-up a... In-Reply-To: <47be3129.2135440a.49bd.ffffceb8@mx.google.com> References: <47be3129.2135440a.49bd.ffffceb8@mx.google.com> Message-ID: <20080222030020.GC11794@sliver.repetae.net> sweet. That IntMap implementation sure is nice compared to Map. go go radix trees! John -- John Meacham - ?repetae.net?john? From lemmih at gmail.com Fri Feb 22 07:00:59 2008 From: lemmih at gmail.com (Lemmih) Date: Fri Feb 22 06:59:13 2008 Subject: [jhc] darcs patch: Cache a few items, this gives a 20% speed bump. Message-ID: <47beb97f.1836440a.3ff9.792c@mx.google.com> Fri Feb 22 12:59:49 CET 2008 Lemmih * Cache a few items, this gives a 20% speed bump. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/x-darcs-patch Size: 4599 bytes Desc: A darcs patch for your repository! Url : http://www.haskell.org/pipermail/jhc/attachments/20080222/8a3d512d/attachment-0001.bin From lemmih at gmail.com Fri Feb 22 16:33:07 2008 From: lemmih at gmail.com (Lemmih) Date: Fri Feb 22 16:31:12 2008 Subject: [jhc] Threading and atoms. Message-ID: Hiya, When I use -threaded and -N2 in jhc, atoms fail in spectacular ways. So far I've seen: >>> internal error: fullCheck': Instance@.iJhc.Monad.return.Jhc.Prim.IO x6598 Strong': ? x6598 Cannot strong: (?, [x6598]) and: stringlib: error alocating memory Are atoms in some way not thread-safe enough? -- Cheers, Lemmih From john at repetae.net Fri Feb 22 16:45:53 2008 From: john at repetae.net (John Meacham) Date: Fri Feb 22 16:43:57 2008 Subject: [jhc] Threading and atoms. In-Reply-To: References: Message-ID: <20080222214553.GB18207@sliver.repetae.net> On Fri, Feb 22, 2008 at 10:33:07PM +0100, Lemmih wrote: > When I use -threaded and -N2 in jhc, atoms fail in spectacular ways. > So far I've seen: > >>> internal error: > fullCheck': Instance@.iJhc.Monad.return.Jhc.Prim.IO x6598 > Strong': ? x6598 > Cannot strong: > (?, [x6598]) This is the RULES thing I was telling you about earlier. you need to compile -fno-rules for now or unpull my recent patches. (expect things to be slow without rules) > and: > stringlib: error alocating memory > > > Are atoms in some way not thread-safe enough? and you need to #define USE_THREADS to 1 in StringTable/StringTable_cbits.c John -- John Meacham - ?repetae.net?john? From lemmih at gmail.com Fri Feb 22 17:29:41 2008 From: lemmih at gmail.com (Lemmih) Date: Fri Feb 22 17:27:49 2008 Subject: [jhc] Threading and atoms. In-Reply-To: <20080222214553.GB18207@sliver.repetae.net> References: <20080222214553.GB18207@sliver.repetae.net> Message-ID: On Fri, Feb 22, 2008 at 10:45 PM, John Meacham wrote: > On Fri, Feb 22, 2008 at 10:33:07PM +0100, Lemmih wrote: > > When I use -threaded and -N2 in jhc, atoms fail in spectacular ways. > > So far I've seen: > > >>> internal error: > > fullCheck': Instance@.iJhc.Monad.return.Jhc.Prim.IO x6598 > > Strong': ? x6598 > > Cannot strong: > > (?, [x6598]) > > This is the RULES thing I was telling you about earlier. you need to > compile -fno-rules for now or unpull my recent patches. (expect things to be slow without rules) Most of what you were telling me went over my head, I'm afraid. (: > > and: > > stringlib: error alocating memory > > > > > > Are atoms in some way not thread-safe enough? > > and you need to #define USE_THREADS to 1 in > StringTable/StringTable_cbits.c Ah, good. -- Cheers, Lemmih From lemmih at gmail.com Fri Feb 22 17:47:39 2008 From: lemmih at gmail.com (Lemmih) Date: Fri Feb 22 17:45:45 2008 Subject: [jhc] Threading and atoms. In-Reply-To: References: <20080222214553.GB18207@sliver.repetae.net> Message-ID: Out of interest, how far away is Jhc from compiling itself? -- Cheers, Lemmih From john at repetae.net Fri Feb 22 17:59:31 2008 From: john at repetae.net (John Meacham) Date: Fri Feb 22 17:57:35 2008 Subject: [jhc] Threading and atoms. In-Reply-To: References: <20080222214553.GB18207@sliver.repetae.net> Message-ID: <20080222225931.GC18207@sliver.repetae.net> On Fri, Feb 22, 2008 at 11:47:39PM +0100, Lemmih wrote: > Out of interest, how far away is Jhc from compiling itself? Pretty far, but it gets closer with each patch. :) there are a couple extensions that need to be implemented before it can try. pattern guards and MPTCs in particular. But I sort of put a self-imposed restriction on myself that I won't start adding new language extensions until all of 'nobench' compiles cleanly. -- John Meacham - ?repetae.net?john? From lemmih at gmail.com Sat Feb 23 09:02:02 2008 From: lemmih at gmail.com (Lemmih) Date: Sat Feb 23 09:00:10 2008 Subject: [jhc] darcs patch: Fix bug in E.Subst. Message-ID: <47c0275e.2534440a.0ca6.3936@mx.google.com> Sat Feb 23 15:00:47 CET 2008 Lemmih * Fix bug in E.Subst. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/x-darcs-patch Size: 1266 bytes Desc: A darcs patch for your repository! Url : http://www.haskell.org/pipermail/jhc/attachments/20080223/cf73c57f/attachment.bin From lemmih at gmail.com Sat Feb 23 09:52:29 2008 From: lemmih at gmail.com (Lemmih) Date: Sat Feb 23 09:50:34 2008 Subject: [jhc] Substitutions. Message-ID: Hiya, I've encountered something I don't understand in E.Subst. The substitution routine is very eager to inline stuff. It inlines all the simple applications it can find. Consider the following example: (\a -> a+a) expensive The substitution routine will inline that to: expensive+expensive Wouldn't it be prudent to generate this instead: let a = expensive in a + a Whether to inline it further would be decided later on. -- Cheers, Lemmih From lemmih at gmail.com Sat Feb 23 09:54:39 2008 From: lemmih at gmail.com (Lemmih) Date: Sat Feb 23 09:53:28 2008 Subject: [jhc] Re: Substitutions. In-Reply-To: References: Message-ID: On Sat, Feb 23, 2008 at 3:52 PM, Lemmih wrote: > Hiya, > > I've encountered something I don't understand in E.Subst. > The substitution routine is very eager to inline stuff. It inlines all > the simple applications it can find. > Consider the following example: > (\a -> a+a) expensive > The substitution routine will inline that to: > expensive+expensive > Wouldn't it be prudent to generate this instead: > let a = expensive in a + a > > Whether to inline it further would be decided later on. Actually, it might be wise not to do any optimizations in the substitution routine. I assume constant applications are handled elsewhere as well? -- Cheers, Lemmih From naesten at gmail.com Sat Feb 23 13:08:12 2008 From: naesten at gmail.com (Samuel Bronson) Date: Sat Feb 23 13:06:14 2008 Subject: [jhc] Threading and atoms. In-Reply-To: <20080222225931.GC18207@sliver.repetae.net> References: <20080222214553.GB18207@sliver.repetae.net> <20080222225931.GC18207@sliver.repetae.net> Message-ID: On Fri, Feb 22, 2008 at 5:59 PM, John Meacham wrote: > On Fri, Feb 22, 2008 at 11:47:39PM +0100, Lemmih wrote: > > Out of interest, how far away is Jhc from compiling itself? > > Pretty far, but it gets closer with each patch. :) > > there are a couple extensions that need to be implemented before it can > try. pattern guards and MPTCs in particular. But I sort of put a > self-imposed restriction on myself that I won't start adding new > language extensions until all of 'nobench' compiles cleanly. That will probably help it to happen faster anyway. From lemmih at gmail.com Sat Feb 23 14:46:51 2008 From: lemmih at gmail.com (Lemmih) Date: Sat Feb 23 14:44:55 2008 Subject: [jhc] Occurance collection. Message-ID: Hiya, In E.SSimplify.collectOccurance the 'arg' function is used quite often and I can't tell why. The function clears usage information for free variables and is primarily used for types. It would be a great help if you'd add a short description to collectOccurance and collectDs. Any information about invariants and/or what they're trying to do would be appreciated. -- Cheers, Lemmih From john at repetae.net Sat Feb 23 18:13:21 2008 From: john at repetae.net (John Meacham) Date: Sat Feb 23 18:11:22 2008 Subject: [jhc] Substitutions. In-Reply-To: References: Message-ID: <20080223231321.GA23373@sliver.repetae.net> On Sat, Feb 23, 2008 at 03:52:29PM +0100, Lemmih wrote: > I've encountered something I don't understand in E.Subst. > The substitution routine is very eager to inline stuff. It inlines all > the simple applications it can find. > Consider the following example: > (\a -> a+a) expensive > The substitution routine will inline that to: > expensive+expensive > Wouldn't it be prudent to generate this instead: > let a = expensive in a + a An invariant is that the only things that may ever appear as arguments to functions or data constructors are atoms. This ensures that beta reduction is always beneficial and that 'let' statements are the one and only way to allocate thunks. GHC core has the same restriction for the same reason, it makes a whole lot of transformations a lot simpler. John -- John Meacham - ?repetae.net?john? From john at repetae.net Sat Feb 23 18:19:14 2008 From: john at repetae.net (John Meacham) Date: Sat Feb 23 18:17:16 2008 Subject: [jhc] Re: Substitutions. In-Reply-To: References: Message-ID: <20080223231914.GB23373@sliver.repetae.net> On Sat, Feb 23, 2008 at 03:54:39PM +0100, Lemmih wrote: > Actually, it might be wise not to do any optimizations in the > substitution routine. I assume constant applications are handled > elsewhere as well? The atom invariant insures that beta reduction is a simple source transformation that does not change the behavior of the program, not an optimization. E normal form number 2 (a name I just made up :) ) says all arguments must be atomic, all applied things may only be simple variables or another application, and lambda expressions may only occur directly on the RHS of a let binding, the body of a let statment, or in a case branch body. normal form 3 (after lambda lifting) says lambda expressions may _only_ occur at the top level, no where else. John -- John Meacham - ?repetae.net?john? From lemmih at gmail.com Sat Feb 23 18:45:55 2008 From: lemmih at gmail.com (Lemmih) Date: Sat Feb 23 18:43:57 2008 Subject: [jhc] Re: Substitutions. In-Reply-To: <20080223231914.GB23373@sliver.repetae.net> References: <20080223231914.GB23373@sliver.repetae.net> Message-ID: On Sun, Feb 24, 2008 at 12:19 AM, John Meacham wrote: > On Sat, Feb 23, 2008 at 03:54:39PM +0100, Lemmih wrote: > > Actually, it might be wise not to do any optimizations in the > > substitution routine. I assume constant applications are handled > > elsewhere as well? > > The atom invariant insures that beta reduction is a simple source > transformation that does not change the behavior of the program, not an > optimization. > > E normal form number 2 (a name I just made up :) ) says > all arguments must be atomic, all applied things may only be simple > variables or another application, and lambda expressions may only occur > directly on the RHS of a let binding, the body of a let statment, or in > a case branch body. > > normal form 3 (after lambda lifting) says lambda expressions may _only_ > occur at the top level, no where else. Excellent, exactly what I needed to fix my test case. -- Cheers, Lemmih From john at repetae.net Mon Feb 25 23:43:20 2008 From: john at repetae.net (John Meacham) Date: Mon Feb 25 23:41:13 2008 Subject: [jhc] Occurance collection. In-Reply-To: References: Message-ID: <20080226044320.GA5260@sliver.repetae.net> On Sat, Feb 23, 2008 at 08:46:51PM +0100, Lemmih wrote: > In E.SSimplify.collectOccurance the 'arg' function is used quite often > and I can't tell why. The function clears usage information for free > variables and is primarily used for types. > It would be a great help if you'd add a short description to > collectOccurance and collectDs. Any information about invariants > and/or what they're trying to do would be appreciated. Basically, occurance collection figured out how many times and in what way variables are used. it figures out things like whether a variable has been used at most once and whether it is in a lambda. If a variable ever appears as an argument to another function, you can't know anything (at least not in this analysis) about how it is used, so the function 'arg' tells it that all the variables in its argument should be considered used an unknown number of times. likewise, types get passed around a lot, and better usage information for them doesn't usually help because a lot of them (but not all) are erased by run time, so we just take the conservative approach and treat them like they were used as an argument to a function, used an unknown number of times. John -- John Meacham - ?repetae.net?john? From naesten at gmail.com Wed Feb 27 15:38:04 2008 From: naesten at gmail.com (Samuel Bronson) Date: Wed Feb 27 15:35:56 2008 Subject: [jhc] Re: Substitutions. In-Reply-To: <20080223231914.GB23373@sliver.repetae.net> References: <20080223231914.GB23373@sliver.repetae.net> Message-ID: On 2/23/08, John Meacham wrote: > On Sat, Feb 23, 2008 at 03:54:39PM +0100, Lemmih wrote: > > Actually, it might be wise not to do any optimizations in the > > substitution routine. I assume constant applications are handled > > elsewhere as well? > > The atom invariant insures that beta reduction is a simple source > transformation that does not change the behavior of the program, not an > optimization. > > E normal form number 2 (a name I just made up :) ) says > all arguments must be atomic, all applied things may only be simple > variables or another application, and lambda expressions may only occur > directly on the RHS of a let binding, the body of a let statment, or in > a case branch body. > > normal form 3 (after lambda lifting) says lambda expressions may _only_ > occur at the top level, no where else. Gee, maybe this should be documented somewhere?